Project description: This project’s purpose is to predict bike traffic by means of weather data. Therefore, we used freely available data on bike traffic collected at 5 bike counting stations in the Rhein-Neuss district in North-Rhine-Westphalia, Germany, and merged it with equally freely available data from the Deutsche Wetterdienst for Düsseldorf, which is geographically the weather station with the highest proximity to the area where the bike traffic was counted (just on the opposite side of the river Rhine).
We will only be using data for the dates on which all of the five counting stations reported some traffic and aggregated these, such that we do not do separated predictions for the stations but rather for the ‘general traffic’ at the corresponding point in time.
The explanatory variables are in most of the cases on an hourly aggregation base, and this is where our regression is taking place. Some are only available on a daily basis – for simplicity, we assume them to be constant throughout the day – and we will use dummies implied by the date (Hour, Month, Weekday) and dummies for national holidays and school holidays.
We warmly recommend to consult our variable list:
data/raw/VariableList.xlsx. It provides you with
information on the DWD abbreviations, our assigned variable names, the
format before and after final processing, measuring units and, if
necessary, an explanation.
About the models and performance metrics, they are trained using cross validation. At the end, all CV error is compared to training error. The goal is to keep these close together, indicating the model is not overfitting. The metric for this is MSE, this is due to it being the default for some packages, such as caret identifying Out Of Bag error.
We have so much data that we can also affort to split 80-20, to get performance metrics on fully new data. These metrics: MAE, RMSE and, R2 are all summarised at the end.
The data sources can be inspected at:
<project directory>/data/raw/sources.ods. The script
contains webscraping, meaning that some informations are always
retrieved when you run the code.
NOTE: If you do not want to reproduce the
whole data preparation process with all steps, you have to
ensure that the process_raw_data variable definition is
FALSE; you have to change it in the setup. Then, you need
to comment out the if(){} condition in the data preparation
chunks in lines appr. 115 and 807 (technical
note: this had to be commented out such that the output for the html
rendering appears throughout the code and not only at the end of the
if-loop). Then, only the preprocessed final data – which is available in
data/final/ will be loaded, reducing the running time by
appr. 5 min.
Ensure process_raw_data exists:
if(!exists('process_raw_data')){
process_raw_data <- TRUE
load_final_data <- !process_raw_data
}
# If data preparation not desired, just load the already saved data.
if(load_final_data) {
data_final_aggr <- readRDS(paste(wd, "data/final/data_final_bike_weather.RDS", sep = "/"))
}
Extensive data preparation:
# if (process_raw_data) {
# ==============================================================================
# Bike traffic data
# ==============================================================================
# Unzip the zip files
unzip(paste(wd, "data/raw/eco-counter-data.zip", sep = "/"), exdir = "bikedata_dir")
bike_data <- read.csv("bikedata_dir/eco-counter-data.csv", sep = ";")
head(bike_data$Datum)
## [1] "2016-04-04T23:00:00+00:00" "2016-04-05T07:00:00+00:00"
## [3] "2016-04-05T09:00:00+00:00" "2016-04-05T14:00:00+00:00"
## [5] "2016-04-05T17:00:00+00:00" "2016-04-05T18:00:00+00:00"
bike_data$Date <- as.Date(substr(bike_data$Datum, 1, 10))
bike_data$Hour <- as.double(substr(bike_data$Datum, 12, 13))
# Junk information: No variation, useless
unique(substr(bike_data$Datum, 11, 11))
## [1] "T"
unique(substr(bike_data$Datum, 14, 26))
## [1] ":00:00+00:00"
#bike_data$Datum <- NULL
# Machine-readable date
bike_data$Date_Hour <- as.POSIXct(paste(bike_data$Date, " ", bike_data$Hour, ":00", sep = ""), format = "%Y-%m-%d %H:%M")
bike_data$Weekday <- weekdays(bike_data$Date, abbreviate = TRUE)
# Data still is in 10-min frequency. Aggregate on hourly base.
hourly_info <- bike_data[!duplicated(paste(bike_data$Date_Hour, bike_data$Zählstelle)), - which(colnames(bike_data) == "Anzahl")]
hourly_number <- aggregate(data = bike_data, Anzahl ~ Date_Hour + Zählstelle, sum)
bike_data_aggr <- merge(hourly_info, hourly_number,
by = c("Date_Hour", "Zählstelle"))
aggregate(data = bike_data_aggr, Date_Hour ~ Zählstelle, min)
## Zählstelle Date_Hour
## 1 Dormagen 2015-11-16 23:00:00
## 2 Grevenbroich 2015-12-07 23:00:00
## 3 Jüchen 2015-12-03 23:00:00
## 4 Meerbusch 2016-01-25 23:00:00
## 5 Neuss 2015-12-10 23:00:00
aggregate(data = bike_data_aggr, Date_Hour ~ Zählstelle, max)
## Zählstelle Date_Hour
## 1 Dormagen 2025-04-28 21:00:00
## 2 Grevenbroich 2025-04-03 21:00:00
## 3 Jüchen 2024-11-14 22:00:00
## 4 Meerbusch 2025-04-28 21:00:00
## 5 Neuss 2025-04-28 21:00:00
table(bike_data_aggr$Zählstelle)
##
## Dormagen Grevenbroich Jüchen Meerbusch Neuss
## 76799 80171 74780 76174 70496
nrow(bike_data) / nrow(bike_data_aggr) # with full data it should be 6.
## [1] 4.748338
# Different data availability times is reason for incomplete data in whole time set
sum(is.na(bike_data$Anzahl))
## [1] 129423
sum(is.na(bike_data_aggr$Anzahl)) # we got rid of all NAs during merging.
## [1] 0
# Plausibility check:
# Daily aggregation
plot(aggregate(data = bike_data_aggr, Anzahl ~ Hour, sum)) # Aggregation is probably
# right when night time has less bikers
# Weekday aggregation
plot(aggregate(data = bike_data_aggr, Anzahl ~ as.factor(Weekday), sum),
main = 'Weekdaily aggregated bike traffic')
# Monthly aggregation
plot(aggregate(data = bike_data_aggr, Anzahl ~ format(substr(Date, 6,7), format = "%m"), sum),
main = 'Monthly aggregated bike traffic')
# Days of the year aggregation
plot(aggregate(data = bike_data_aggr, Anzahl ~ as.Date(substr(Date, 6,10), format = "%m-%d"), sum),
main = 'By calendar day: aggregated bike traffic')
# SIMPLIFICATIONS
colnames(bike_data_aggr)
## [1] "Date_Hour" "Zählstelle" "Id" "Datum" "Status"
## [6] "Channel.Id" "Koordinaten" "Date" "Hour" "Weekday"
## [11] "Anzahl"
unique(paste(bike_data_aggr$Zählstelle, bike_data_aggr$Id))
## [1] "Dormagen 100019716" "Jüchen 100019719" "Grevenbroich 100019718"
## [4] "Neuss 100019717" "Meerbusch 100019715"
bike_data_aggr$Id <- NULL # ID not needed
bike_data_aggr$Koordinaten <- NULL # not needed
bike_data_aggr$Datum <- NULL # date has been transformed to useful
unique(bike_data_aggr$Status)
## [1] "raw" "modified" ""
bike_data_aggr$Status <- NULL # we will not examine quality of measurement
unique(bike_data_aggr$Channel.Id) # not useful for our purpose
## [1] 104019716 102019716 103019716 101019716 105019716 106019716 104019719
## [8] 102019719 103019719 101019719 104019718 101019718 103019718 102019718
## [15] 103019717 104019717 101019717 102019717 103019715 102019715 104019715
## [22] 101019715
bike_data_aggr$Channel.Id <- NULL
# ==============================================================================
# Meteorological hourly data - for Düsseldorf (very close to the places that are relevant for us)
# ==============================================================================
# What it interesting:
# Added manually after inspection of the Metadaten_Parameter_<...>_01078.html
# files which are to be found in the zip files in the data/raw directory,
# or -- after running the code -- in the unpacked zip folders.
var_of_int <- c(Date_Hour = "Date_Hour",
V_N = "cloud_cover",
V_TE002 = "tmp_soil",
DD = "wind_degree",
FF = "wind_speed",
FX_911 = "wind_highest_last_h",
TT = "air_temp",
R1 = "precip_h_mm",
RS_ind = "precip_h_indicator",
WRTR = "precip_type",
SD_SO = "sunshine_duration",
RF_STD = "rel_humidity",
P_STD = "air_pressure",
V_VV = "visibility",
WW = "observed_weather")
# List of files
filelist_meteo <- list.files(path = paste(wd, "data/raw", sep = "/"))
# Read in hourly data + NOT the verbal description, will not be of interest
filelist_meteo <- filelist_meteo[intersect(which(grepl("stundenwerte", filelist_meteo)),
which(!grepl("WW", filelist_meteo)))]
# Loop for unzipping data
for(i in 1:length(filelist_meteo)) {
tmp_dir <- paste("tmp/", substr(filelist_meteo[i], 1, 15), sep = "")
unzip_dir <- unzip(paste(wd, "data/raw", filelist_meteo[i], sep = "/"),
exdir = tmp_dir)
list_files_zip <- list.files(tmp_dir)
relevant_csv <- list_files_zip[which(grepl("produkt", list_files_zip))]
csv <- read.csv(paste(tmp_dir, relevant_csv, sep = "/"), sep = ";")
probe <- read.csv(paste(tmp_dir, relevant_csv, sep = "/"), sep = ";", fileEncoding = "UTF-8")
# Identify NAs, encoded as -999
csv[csv == -999] <- NA
# Construct uniform date
csv$Date_Hour <- as.POSIXct(paste(substr(csv$MESS_DATUM, 1, 4), "-",
substr(csv$MESS_DATUM, 5, 6), "-",
substr(csv$MESS_DATUM, 7, 8), " ",
substr(csv$MESS_DATUM, 9, 10), ":00",
sep = ""),
format = "%Y-%m-%d %H:%M")
# Only needed for certain period
csv <- csv[csv$Date_Hour >= "2015-11-16 23:00:00",]
csv <- csv[rowSums(is.na(csv)) < ncol(csv),]
# Save only what is relevant for final data, and check no double saving happens.
keep_cols <- if(!"csv_pass" %in% ls()){
which(colnames(csv) %in% names(var_of_int))
} else {
intersect(which(colnames(csv) %in% names(var_of_int)),
which(!var_of_int[colnames(csv)] %in% colnames(csv_pass)[-which(colnames(csv_pass) == "Date_Hour")]))
}
csv_final <- csv[, keep_cols]
# When no relevant var_of_interest in the table:
# No further processing
if(is.null(ncol(csv_final))) {next}
colnames(csv_final) <- var_of_int[colnames(csv_final)]
# If it does not exist -> initiate the data frame
# If it exists -> merge the new columns with the existing ones
if(i == 1) {
csv_pass <- csv_final
} else {
csv_pass <- merge(csv_final, csv_pass, all.x = TRUE, all.y = TRUE, by = "Date_Hour")
}
# Bring it together with the hourly bike data
weather_data_final <- merge(csv_pass, bike_data_aggr, by = "Date_Hour")
}
length(which(weather_data_final$wind_degree == 360))
## [1] 7302
colnames(weather_data_final)
## [1] "Date_Hour" "visibility" "air_pressure"
## [4] "rel_humidity" "air_temp" "sunshine_duration"
## [7] "precip_h_mm" "precip_type" "wind_highest_last_h"
## [10] "wind_speed" "wind_degree" "tmp_soil"
## [13] "cloud_cover" "Zählstelle" "Date"
## [16] "Hour" "Weekday" "Anzahl"
# Just for test whether data can already be usefully used
first_lm <- lm(data = weather_data_final[weather_data_final$Zählstelle == "Meerbusch",], Anzahl ~ air_temp + as.factor(Hour) + as.factor(Weekday) + wind_speed)
summary(first_lm)
##
## Call:
## lm(formula = Anzahl ~ air_temp + as.factor(Hour) + as.factor(Weekday) +
## wind_speed, data = weather_data_final[weather_data_final$Zählstelle ==
## "Meerbusch", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -92.17 -15.77 -2.99 10.56 430.38
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -16.93149 0.66844 -25.330 < 2e-16 ***
## air_temp 1.99640 0.01639 121.787 < 2e-16 ***
## as.factor(Hour)1 0.46875 0.78555 0.597 0.550702
## as.factor(Hour)2 1.05772 0.78623 1.345 0.178533
## as.factor(Hour)3 3.18931 0.78571 4.059 4.93e-05 ***
## as.factor(Hour)4 9.51088 0.78567 12.105 < 2e-16 ***
## as.factor(Hour)5 21.04927 0.78531 26.804 < 2e-16 ***
## as.factor(Hour)6 26.01314 0.78533 33.124 < 2e-16 ***
## as.factor(Hour)7 22.99559 0.78580 29.264 < 2e-16 ***
## as.factor(Hour)8 24.83627 0.78674 31.569 < 2e-16 ***
## as.factor(Hour)9 32.87642 0.78810 41.716 < 2e-16 ***
## as.factor(Hour)10 39.17158 0.79009 49.578 < 2e-16 ***
## as.factor(Hour)11 41.86622 0.79212 52.853 < 2e-16 ***
## as.factor(Hour)12 44.80531 0.79372 56.450 < 2e-16 ***
## as.factor(Hour)13 48.14569 0.79422 60.620 < 2e-16 ***
## as.factor(Hour)14 50.55882 0.79449 63.637 < 2e-16 ***
## as.factor(Hour)15 50.11603 0.79375 63.138 < 2e-16 ***
## as.factor(Hour)16 43.56313 0.79215 54.994 < 2e-16 ***
## as.factor(Hour)17 28.60312 0.79045 36.186 < 2e-16 ***
## as.factor(Hour)18 13.76862 0.78843 17.463 < 2e-16 ***
## as.factor(Hour)19 4.78845 0.78678 6.086 1.16e-09 ***
## as.factor(Hour)20 0.52952 0.78584 0.674 0.500425
## as.factor(Hour)21 -0.55737 0.78535 -0.710 0.477886
## as.factor(Hour)22 -0.51252 0.78525 -0.653 0.513965
## as.factor(Hour)23 -0.39521 0.78528 -0.503 0.614772
## as.factor(Weekday)Mon 1.50457 0.42404 3.548 0.000388 ***
## as.factor(Weekday)Sat 7.13426 0.42359 16.842 < 2e-16 ***
## as.factor(Weekday)Sun 18.03232 0.42393 42.536 < 2e-16 ***
## as.factor(Weekday)Thu 1.80446 0.42359 4.260 2.05e-05 ***
## as.factor(Weekday)Tue 1.53088 0.42315 3.618 0.000297 ***
## as.factor(Weekday)Wed 2.88904 0.42324 6.826 8.81e-12 ***
## wind_speed -2.21261 0.05039 -43.911 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.05 on 65786 degrees of freedom
## (126 observations deleted due to missingness)
## Multiple R-squared: 0.4552, Adjusted R-squared: 0.455
## F-statistic: 1773 on 31 and 65786 DF, p-value: < 2.2e-16
unique(weather_data_final$Zählstelle)
## [1] "Dormagen" "Jüchen" "Grevenbroich" "Neuss" "Meerbusch"
# What is most busy station
aggregate(weather_data_final, Anzahl ~ Zählstelle, sum)
## Zählstelle Anzahl
## 1 Dormagen 346425
## 2 Grevenbroich 388756
## 3 Jüchen 609886
## 4 Meerbusch 1606697
## 5 Neuss 1059233
# Strong wind -> correlation with some direction?
plot(aggregate(weather_data_final, wind_highest_last_h ~ wind_degree, mean),
main = 'Relation between wind and wind speed')
colSums(is.na(weather_data_final))
## Date_Hour visibility air_pressure rel_humidity
## 0 290 922 922
## air_temp sunshine_duration precip_h_mm precip_type
## 257 85525 426 94200
## wind_highest_last_h wind_speed wind_degree tmp_soil
## 1008 360 360 338403
## cloud_cover Zählstelle Date Hour
## 313 0 0 0
## Weekday Anzahl
## 0 0
# Look whether the NAs in sunshine duration come from reporting NA instead of
# 0 at night
length(intersect(which(is.na(weather_data_final$sunshine_duration)),
which(as.numeric(substr(weather_data_final$Date_Hour, 12, 13)) %in% c(21, 22, 23, 0, 1, 2)) ))
## [1] 70448
# When data is not available and it is at night, then assume 0min of sunshine
weather_data_final$sunshine_duration[intersect(which(is.na(weather_data_final$sunshine_duration)),
which(as.numeric(substr(weather_data_final$Date_Hour, 12, 13)) %in% c(21, 22, 23, 0, 1, 2)) )] <- 0
# Plausibility checks
# Yearly mean temperatures
plot(aggregate(weather_data_final, air_temp ~ (substr(Date_Hour, 1, 4)), mean), type = "l",
main = 'Mean temperatures by year')
# Mean temperature at the days of the year
plot(aggregate(weather_data_final, air_temp ~ as.Date(substr(Date_Hour, 6, 10), format = "%m-%d"), mean), type = "l",
main = 'Mean temperature per calendar days')
# Bike traffic at calendar dates
plot(aggregate(bike_data_aggr, Anzahl ~ as.Date(substr(Date_Hour, 6, 10), format = "%m-%d"), sum),
main = 'Aggregated bike traffic at calendar dates')
# Give english column name for number of bikes
colnames(weather_data_final) <- gsub("Anzahl", "Bikes_Counted", colnames(weather_data_final))
# ==============================================================================
# More detailed meteorological data on daily aggregation
# ==============================================================================
# More variables are available if we are not only looking at hourly, but daily
# aggregation basis.
# csv for structure inspection
# https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/subdaily/standard_format/formate_kl.html
# Can be obtained from:
data_subdaily <- read.csv(file = "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/subdaily/standard_format/kl_10400_00_akt.txt")[1,]
# Visually, this is a mess!
# Automatical column assignment
# Use html meta-file for more rapid data identification -- position in string is required
# Structure of html file allows for quick identification of position in string for
# relevant variables
library(rvest)
## Warning: package 'rvest' was built under R version 4.4.3
url <- "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/subdaily/standard_format/formate_kl.html"
page <- read_html(url)
table <- page |> html_table(fill = TRUE)
table_html <- table[[1]]
colnames(table_html) <- table_html[1,]
table_html <- table_html[-1,]
table_html <- data.frame(Label = table_html$Label,
start_pos = as.numeric(table_html$Pos))
# Find out difference between starting points of variable entries
width_of_col <- diff(table_html$start_pos)
# Define widths of the separated variable entry 'columns'
width_of_col <- c(width_of_col, (nchar(data_subdaily) - sum(width_of_col)))
# This is fwf data:
# One string, and only by providing the column lengths it can split it up
# We know the widths now, as well as the relevant variables.
data_subdaily <- read.fwf("https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/subdaily/standard_format/kl_10393_00_akt.txt",
widths = width_of_col, stringsAsFactors = FALSE,
col.names = table_html$Label)
#data_subdaily[1, 1:10]
# Create machine-readible dates from the data columns
data_subdaily$Date_Year <- as.Date(paste(data_subdaily$JA,
sprintf("%02d", data_subdaily$MO), # paste as 2digit
sprintf("%02d", data_subdaily$TA),
sep = "-"))
# Filter only relevant time period
data_subdaily <- data_subdaily[data_subdaily$Date_Year >= "2015-11-16",]
# Select only relevant columns -> cf. https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/subdaily/standard_format/formate_kl.html
subdaily_var_of_int <- c(Date_Year = "Date_Year",
SDK = "sunshine_dur_day",
E1 = "soilstate_7am",
VAK = "type_precip",
SHK = "snow_cover_cm")
# Find which are available AND of interest, only keep these
data_subdaily <- data_subdaily[, which(colnames(data_subdaily) %in% names(subdaily_var_of_int))]
# Assign human-interpretable colnames
colnames(data_subdaily) <- subdaily_var_of_int[colnames(data_subdaily)]
# Identify NAs -> again, encoded as -99
data_subdaily[data_subdaily == -99] <- NA
# Plausability analysis: Snow coverage by calendar date
plot(aggregate(data_subdaily, snow_cover_cm ~ as.Date(substr(Date_Year, 6, 10), format = "%m-%d"), mean), type = "l",
main = 'Plausibility check for snow coverage by calendar date')
# Prepare for hourly merging -> clone each row 24 times and create artificial
# time column
data_subdaily <- data_subdaily[rep(1:nrow(data_subdaily), each = 24),]
data_subdaily$Hour <- rep(sprintf("%02d", 0:23), times = (nrow(data_subdaily) / 24))
data_subdaily$Date_Hour <- paste(data_subdaily$Date_Year, " ", data_subdaily$Hour, ":00:00", sep = "") |> as.POSIXct(format = "%Y-%m-%d %H:%M")
data_subdaily$Date_Year <- NULL; data_subdaily$Hour <- NULL
# Merge with final data
weather_data_PRELIM <- merge(weather_data_final, data_subdaily, by = "Date_Hour", all.x = TRUE)
# CHECK whether the new df WITHOUT the daily data corresponds exactly
# to the df before to exclude some merging fails
identical(weather_data_final[, intersect(colnames(weather_data_final), colnames(weather_data_PRELIM))],
weather_data_PRELIM[, intersect(colnames(weather_data_final), colnames(weather_data_PRELIM))])
## [1] TRUE
# Delete tmp version
weather_data_final <- weather_data_PRELIM; weather_data_PRELIM <- NULL
# =========================================
# Holiday data
# =========================================
# Use webscraping!
# Initiate: Look for years, prepare empty vector
years2check <- unique(substr(weather_data_final$Date_Hour, 1, 4))
holidays <- c()
for(i in 1:length(years2check)) {
# Make use of structured urls
address2check <- paste("https://www.schulferien.org/Kalender_mit_Ferien/kalender_", years2check[i], "_ferien_Nordrhein_Westfalen.html", sep = "")
page <- read_html(address2check)
text <- page |> html_nodes(".feiertag_liste_row") |> html_text()
# holidays are stored always in that node!
# for compatability, translation of Mär to Mrz is needed
# Make use of unique structure and retrieve day, month, year
dates <- paste(gsub("Mär", "Mrz", substr(text, 16, 22)), substr(text, 38, 41), sep = " ") |> as.Date(format = "%d. %b %Y", locale = "de_DE")
# add to holiday vector
holidays <- c(holidays, dates)
}
# Re-convert from numeric to actual date
holidays <- holidays |> as.Date(origin = "1970-01-01")
# Do the same for school holidays
school_holidays <- c()
for(i in 1:length(years2check)) {
address2check <- paste("https://www.schulferien.org/Kalender_mit_Ferien/kalender_", years2check[i], "_ferien_Nordrhein_Westfalen.html", sep = "")
page <- read_html(address2check)
# holiday data most easily retrieved like this
text_raw <- page |> html_nodes("td") |> html_attr("data-tip-text")
text_raw <- text_raw[-which(is.na(text_raw))]
# Delete superficial text information
text_raw <- gsub("</span><br>Nordrhein-Westfalen <br>", " ", gsub("<span class=\"tmv_tooltip_title\">", "", text_raw))
# Look for structured information which indicates a timespan between two dates
school_holidays_year <- unlist(regmatches(text_raw, gregexpr(text_raw, pattern = "\\d{2}\\.\\d{2}\\.\\d{4} - \\d{2}\\.\\d{2}\\.\\d{4}"))) |> unique()
hol_in_y <- c()
# Only beginning and end date known until now, we want all days in between as well
for(j in 1:length(school_holidays_year)) {
obj <- school_holidays_year[j]
date1 <- as.Date(substr(obj, 1, 10), format = "%d.%m.%Y")
date2 <- as.Date(substr(obj, 14, nchar(obj)), format = "%d.%m.%Y")
holiday_seq <- date1:date2
hol_in_y <- c(hol_in_y, holiday_seq)
}
# Store values in the array
school_holidays <- c(school_holidays, hol_in_y)
}
# Again, machine readible date needed.
school_holidays <- school_holidays |> as.Date(origin = "1970-01-01")
# Add to final data set: As dummies is most useful!
# Check for each date whether it is in the holidays or school_holidays array
weather_data_final$Holiday <- ifelse(weather_data_final$Date %in% holidays, 1, 0)
# Plausibility check
table(weather_data_final$Holiday)
##
## 0 1
## 333891 4512
weather_data_final$School_holidays <- ifelse(weather_data_final$Date %in% school_holidays, 1, 0)
# Plausibility check
table(weather_data_final$School_holidays)
##
## 0 1
## 258535 79868
# Final cleaning
# Which columns have more than 5% NA still?
which(colSums(is.na(weather_data_final)) > 0.05 * nrow(weather_data_final))
## precip_type tmp_soil snow_cover_cm
## 8 12 22
# DELETE
unique(weather_data_final$tmp_soil)
## [1] NA
weather_data_final$tmp_soil <- NULL
head(weather_data_final$precip_type, 20)
## [1] 6 NA 6 6 NA 6 6 NA 6 6 NA 6 0 NA 0 6 NA 0 0 NA
# Every third element seems to be missing. Check that:
aggregate(is.na(weather_data_final$precip_type), by = list(rep(1:3, length.out = nrow(weather_data_final)),
weather_data_final$Zählstelle),
FUN = sum)
## Group.1 Group.2 x
## 1 1 Dormagen 6660
## 2 2 Dormagen 6443
## 3 3 Dormagen 6218
## 4 1 Grevenbroich 6417
## 5 2 Grevenbroich 6460
## 6 3 Grevenbroich 6251
## 7 1 Jüchen 6470
## 8 2 Jüchen 6409
## 9 3 Jüchen 6245
## 10 1 Meerbusch 5828
## 11 2 Meerbusch 6019
## 12 3 Meerbusch 5725
## 13 1 Neuss 6398
## 14 2 Neuss 6620
## 15 3 Neuss 6037
# Third of value missing.
sum(is.na(weather_data_final$precip_type)) / nrow(weather_data_final)
## [1] 0.2783663
# Why is that? There is a reason:
# cf. rmd/tmp/stundenwerte_RR/Metadaten_Parameter_rr_stunde_01078.html:
# not every hour reported! some are missing, e. g. 3, 6, 9 ...
# For simplicity: Just assume value from previous hour. (might not always be correct,
# but omitting roughly every third row would not be adequate in that setting)
# "Forward filling"
for(k in 1:nrow(weather_data_final)) {
if(is.na(weather_data_final$precip_type[k])) {
weather_data_final$precip_type[k] <- weather_data_final$precip_type[k - 1]
}
}
# Look how many still NA
sum(is.na(weather_data_final$precip_type))
## [1] 0
# Snow cover: missing
length(which(is.na(weather_data_final$snow_cover_cm)))
## [1] 35134
summer_and_no_snow <- intersect(which(is.na(weather_data_final$snow_cover_cm)),
which(as.double(substr(weather_data_final$Date, 6, 7)) %in% 4:10))
length(summer_and_no_snow)
## [1] 35038
# For those days which are in summer and have NA for snow coverage -> Assume 0
weather_data_final$snow_cover_cm[summer_and_no_snow] <- 0
(colSums(is.na(weather_data_final)) / nrow(weather_data_final) * 100) |> sort(decreasing = T)
## sunshine_duration wind_highest_last_h air_pressure rel_humidity
## 4.45533875 0.29786970 0.27245621 0.27245621
## precip_h_mm wind_speed wind_degree cloud_cover
## 0.12588541 0.10638204 0.10638204 0.09249327
## visibility air_temp snow_cover_cm Date_Hour
## 0.08569664 0.07594495 0.02836854 0.00000000
## precip_type Zählstelle Date Hour
## 0.00000000 0.00000000 0.00000000 0.00000000
## Weekday Bikes_Counted sunshine_dur_day soilstate_7am
## 0.00000000 0.00000000 0.00000000 0.00000000
## type_precip Holiday School_holidays
## 0.00000000 0.00000000 0.00000000
# After this: only moderate share of NAs in every column (max 0.3%), neglect this
# Look whether data-specific NAs still exist and were not accounted for yet,
# encoded with i. a. -9, -999 or so -> NAs could still be encoded as negatives!
# Check for every column how many are encoded with DWD-NA-codes
colSums(weather_data_final == -9 | weather_data_final == -99 | weather_data_final == -999) |> sort(decreasing = T)
## soilstate_7am type_precip Date_Hour precip_type
## 152587 152587 0 0
## Zählstelle Date Hour Weekday
## 0 0 0 0
## Bikes_Counted sunshine_dur_day Holiday School_holidays
## 0 0 0 0
# Check at discrete variable outcomes
table(weather_data_final$soilstate_7am)
##
## -9 0 1 2 4 5 7 8
## 152587 39000 112957 5325 17302 360 8688 2184
table(weather_data_final$type_precip)
##
## -9 0 1 2 3 5 6 7 8 10 15
## 152587 72223 92504 3959 1320 4920 2472 360 2208 2730 120
## 21 25 26 30
## 1128 696 480 696
# Not-reporting of soil state could be season-related
aggregate(weather_data_final$soilstate_7am == -9, by = list(substr(weather_data_final$Date, 6, 7)), FUN = sum)
## Group.1 x
## 1 01 10007
## 2 02 8736
## 3 03 11065
## 4 04 13550
## 5 05 14136
## 6 06 13680
## 7 07 14136
## 8 08 14083
## 9 09 13678
## 10 10 13628
## 11 11 13161
## 12 12 12727
# It is not! Maybe year...
aggregate(weather_data_final$soilstate_7am == -9, by = list(substr(weather_data_final$Date, 1, 4)), FUN = sum)
## Group.1 x
## 1 2015 0
## 2 2016 0
## 3 2017 0
## 4 2018 0
## 5 2019 0
## 6 2020 33856
## 7 2021 41817
## 8 2022 43790
## 9 2023 33124
# Visually:
plot(weather_data_final$Date_Hour, ifelse(weather_data_final$soilstate_7am == -9, 1, 0),
main = 'Check of data availability of soilstate')
# Apparently, soil state not reported any more after 2020 not reported any more => what to do?
# SAME check for type_precip...
# Not-reporting could be season-related
aggregate(weather_data_final$type_precip == -9, by = list(substr(weather_data_final$Date, 6, 7)), FUN = sum)
## Group.1 x
## 1 01 10007
## 2 02 8736
## 3 03 11065
## 4 04 13550
## 5 05 14136
## 6 06 13680
## 7 07 14136
## 8 08 14083
## 9 09 13678
## 10 10 13628
## 11 11 13161
## 12 12 12727
# It is not! Maybe year...
aggregate(weather_data_final$type_precip == -9, by = list(substr(weather_data_final$Date, 1, 4)), FUN = sum)
## Group.1 x
## 1 2015 0
## 2 2016 0
## 3 2017 0
## 4 2018 0
## 5 2019 0
## 6 2020 33856
## 7 2021 41817
## 8 2022 43790
## 9 2023 33124
plot(weather_data_final$Date_Hour, ifelse(weather_data_final$type_precip == -9, 1, 0),
main = 'Check for data availability of precipitation type')
# Apparently, not reported any more after 2020 not reported any more => what to do?
# DECISION TO TAKE: either drop, or subset on years which (first would be more reasonable)
# => NOTE: Final drop will be performed later
# Plausibility checks again: mean of Bikes per weekdays
aggregate(weather_data_final, Bikes_Counted ~ Weekday, mean)
## Weekday Bikes_Counted
## 1 Fri 9.508385
## 2 Mon 9.750259
## 3 Sat 13.292817
## 4 Sun 20.246789
## 5 Thu 10.163430
## 6 Tue 9.612277
## 7 Wed 10.388600
plot(aggregate(weather_data_final, air_temp ~ lubridate::month(weather_data_final$Date_Hour), mean))
# Check whether general difference in bike traffic intensity across stations
aggregate(weather_data_final, Bikes_Counted ~ Zählstelle, mean)
## Zählstelle Bikes_Counted
## 1 Dormagen 4.886107
## 2 Grevenbroich 5.515678
## 3 Jüchen 8.847134
## 4 Meerbusch 24.364567
## 5 Neuss 17.045638
# Check for perfect collinearity -> we need to avoid this
weather_data_final |> dplyr::select(where(is.numeric)) |> na.omit() |> cor() |>
heatmap(Rowv = NA, Colv = NA)
# Aggregate all stations (check for differing availability, timespan when all are available)
# Instead of running 5 different models or some station dummy, we will predict more on
# the overall bike traffic in this district.
# Therefore, we aggregate ALL stations for the point in time when ALL are available,
# such that there are no skewed sums due to structural unavailabilities.
mins <- c()
for(station in unique(weather_data_final$Zählstelle)) {
newmin <- min((weather_data_final |> subset(Zählstelle == station))$Date_Hour)
mins <- mins |> c(newmin)
}
# Find the latest beginner
max(mins)
## [1] 1453759200
maxs <- c()
for(station in unique(weather_data_final$Zählstelle)) {
newmax <- max((weather_data_final |> subset(Zählstelle == station))$Date_Hour)
maxs <- maxs |> c(newmax)
}
# Find the earliest drop-out station
min(maxs)
## [1] 1673730000
# Check how many stations available at every point in time
table_availability <- aggregate(weather_data_final, Zählstelle ~ Date_Hour, FUN = function (x) length(unique(x)))
# Find the Hours and Dates at which ALL STATIONS are available!
allstationsavailable_time_hour <- table_availability$Date_Hour[which(table_availability$Zählstelle == max(table_availability$Zählstelle))]
# Earliest and latest common date
allstationsavailable_time_hour[1]; allstationsavailable_time_hour[length(allstationsavailable_time_hour)]
## [1] "2016-01-25 23:00:00 CET"
## [1] "2023-01-14 22:00:00 CET"
# ONLY keep dates where all are available!
weather_data_final <- weather_data_final[which(weather_data_final$Date_Hour %in% allstationsavailable_time_hour),]
# roughly 50,000 obs lost... But still more than enough :)
# NOW: split up again the data (because decision to sum up all stations was taken only at this point,
# and we did not want to change again from the beginning with potential code problems after this change)
# into weather- and observational data, remerge it later.
bikes_only <- weather_data_final[, c("Date_Hour", "Zählstelle", "Bikes_Counted")]
weather_only <- weather_data_final[, c("Date_Hour", setdiff(colnames(weather_data_final), colnames(bikes_only)))]
# Aggregate the bike traffic from all stations for every hour of common availability
bikes_aggr <- aggregate(bikes_only, Bikes_Counted ~ Date_Hour, sum)
# Drop duplicates -> every hour needs to be documented only once
# All hourly data for one particular hour are redundant anyway
weather_only <- unique(weather_only)
# As expected, the remaining data of only the bikes and the deletion of the duplicate rows
# in the weather data lead to the same data length.
# We have ONE row for the weather at each date/hour combination,
# and ONE row for the bike traffic at each.
nrow(bikes_aggr) == nrow(weather_only)
## [1] TRUE
# We can prettily bring it together now.
data_final_aggr <- merge(bikes_aggr, weather_only, by = "Date_Hour")
# Decision on deletion of variables with little availability
length(which(data_final_aggr$soilstate_7am == -9)) / nrow(data_final_aggr)
## [1] 0.3851679
length(which(data_final_aggr$type_precip == -9)) / nrow(data_final_aggr)
## [1] 0.3851679
# Very high unavailability... This is a problem.
# Do these variables explain anything, looked at all alone?
summary(lm(data_final_aggr,
formula = Bikes_Counted ~ as.factor(soilstate_7am) + as.factor(type_precip))
)$r.squared
## [1] 0.04306952
# Either delete 38% of data or two variables which jointly explain 4% of variation?
# More reasonable to delete those variables.
data_final_aggr$soilstate_7am <- NULL
data_final_aggr$type_precip <- NULL
# How many NA left? Final check...
nrow(data_final_aggr) - nrow(na.omit(data_final_aggr))
## [1] 2800
sum(na.omit(data_final_aggr) == -9)
## [1] 0
# Only 436 lines will be deleted + no coded NAs any more.
# This is good! Now we can just delete the NA rows and have a pretty fully available
# dataset
data_final_aggr <- data_final_aggr |> na.omit()
# Delete useless info for final data set:
colnames(data_final_aggr)
## [1] "Date_Hour" "Bikes_Counted" "visibility"
## [4] "air_pressure" "rel_humidity" "air_temp"
## [7] "sunshine_duration" "precip_h_mm" "precip_type"
## [10] "wind_highest_last_h" "wind_speed" "wind_degree"
## [13] "cloud_cover" "Date" "Hour"
## [16] "Weekday" "sunshine_dur_day" "snow_cover_cm"
## [19] "Holiday" "School_holidays"
# Can be re-constructed if needed later
data_final_aggr$Date <- NULL
# ====================================================================
# Reformatting
# ====================================================================
# Now, we want to change the variables to their most common/understandable
# units and assign them their most adequate type.
#class(data_final_aggr$Holiday)
# Visibility in km instead of m
data_final_aggr$visibility <- data_final_aggr$visibility / 1000
unique(data_final_aggr$precip_type)
## [1] 0 6 8 7 2 4
# Wind direction -- 8-fold separation instead of degree as numeric
# We do not want 36 different dummies either
unique(data_final_aggr$wind_degree) |> sort()
## [1] 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180
## [20] 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360
wind_degr_dir <- data.frame(degr = seq(0, 360, by = 10))
wind_degr_dir$dir <- c(rep("N", 3),
rep("NE", 4),
rep("E", 5),
rep("SE", 4),
rep("S", 5),
rep("SW", 4),
rep("W", 5),
rep("NW", 4),
rep("N", 3))
# Assigns the right verbal description (e.g. NE = North-East and so on)
winddir_assign <- array(wind_degr_dir$dir, dimnames = list(wind_degr_dir$degr))
data_final_aggr$wind_dir <- winddir_assign[as.character(data_final_aggr$wind_degree)]
# check if assignment successful:
unique(data_final_aggr[, c("wind_dir", "wind_degree")])
## wind_dir wind_degree
## 1 S 180
## 3 SW 220
## 4 SW 240
## 5 SW 210
## 12 S 200
## 14 S 190
## 16 SW 230
## 53 W 280
## 54 NW 320
## 55 W 260
## 57 W 290
## 58 W 250
## 234 NW 300
## 235 NW 330
## 236 N 340
## 237 N 20
## 238 N 360
## 245 S 170
## 247 SE 140
## 248 SE 150
## 251 S 160
## 350 SE 130
## 356 W 270
## 425 SE 120
## 428 E 80
## 429 NE 60
## 433 NE 50
## 435 N 0
## 441 E 110
## 442 E 70
## 451 E 100
## 456 NE 40
## 484 N 350
## 495 N 10
## 518 NE 30
## 526 E 90
## 548 NW 310
# Worked well
data_final_aggr$wind_degree <- NULL
data_final_aggr$wind_dir <- data_final_aggr$wind_dir |> as.factor()
# precip_type
class(data_final_aggr$precip_type) # needs to be treated as factor
## [1] "integer"
data_final_aggr$precip_type <- data_final_aggr$precip_type |> as.factor()
# cloud_cover
class(data_final_aggr$cloud_cover) # makes more sense in dummy interpretation
## [1] "integer"
data_final_aggr$cloud_cover <- data_final_aggr$cloud_cover |> as.factor()
# hour dummy
class(data_final_aggr$Hour) # also more sense as factor, because no linear trend throughout day
## [1] "numeric"
data_final_aggr$Hour <- as.factor(data_final_aggr$Hour)
# weekday
class(data_final_aggr$Weekday) # transform to factor
## [1] "character"
data_final_aggr$Weekday <- data_final_aggr$Weekday |> as.factor()
# make holiday dummies factors as well
data_final_aggr$Holiday <- data_final_aggr$Holiday |> as.factor()
data_final_aggr$School_holidays <- data_final_aggr$School_holidays |> as.factor()
# Two times sun cover: once on hourly base, and on daily base. check whether identical...
# aggregate(data_final_aggr, sunshine_dur_day ~ substr(Date_Hour, 1, 10), FUN = mean)[1:20]
# aggregate(data_final_aggr, sunshine_duration ~ substr(Date_Hour, 1, 10), FUN = sum)[1:20]
c(min(data_final_aggr$sunshine_dur_day), max(data_final_aggr$sunshine_dur_day))
## [1] 0 160
unique(data_final_aggr$sunshine_dur_day) |> head(50)
## [1] 0 33 50 14 41 39 47 18 40 53 11 26 1 36 84 3 23 16 21
## [20] 30 74 59 4 19 86 10 48 72 2 90 101 114 58 5 12 8 88 81
## [39] 89 122 87 35 85 71 42 96 34 7 44 54
# no point in the second variable: delete ist
data_final_aggr$sunshine_dur_day <- NULL
# Plausability analysis for other sunshine duration
sunsh_per_day <- aggregate(data_final_aggr, sunshine_duration ~ substr(Date_Hour, 1, 10), FUN = sum)
plot(1:366, (aggregate(sunsh_per_day, sunshine_duration ~ substr(`substr(Date_Hour, 1, 10)`, 6, 10), max)[,2])/60,
main = "Plausibility check: Sunshine duration per cal. date")
# YES, naturally limited seasonality pattern recognizable.
# Inspect precip_type, because 2 is not specified in online source
# cf. https://www.dwd.de/DE/leistungen/klimadatendeutschland/beschreibung_tagesmonatswerte.html
# Tabelle NIEDERSCHLAGSHOEHE_IND (30 May 2025)
plot(aggregate(data_final_aggr, precip_h_mm ~ precip_type, mean))
table(data_final_aggr$precip_type) # code 2 unimportant, only 6 occ.
##
## 0 2 4 6 7 8
## 42852 6 58 11168 197 281
# Check whether types are as needed:
sapply(data_final_aggr, class)
## $Date_Hour
## [1] "POSIXct" "POSIXt"
##
## $Bikes_Counted
## [1] "integer"
##
## $visibility
## [1] "numeric"
##
## $air_pressure
## [1] "numeric"
##
## $rel_humidity
## [1] "numeric"
##
## $air_temp
## [1] "numeric"
##
## $sunshine_duration
## [1] "numeric"
##
## $precip_h_mm
## [1] "numeric"
##
## $precip_type
## [1] "factor"
##
## $wind_highest_last_h
## [1] "numeric"
##
## $wind_speed
## [1] "numeric"
##
## $cloud_cover
## [1] "factor"
##
## $Hour
## [1] "factor"
##
## $Weekday
## [1] "factor"
##
## $snow_cover_cm
## [1] "numeric"
##
## $Holiday
## [1] "factor"
##
## $School_holidays
## [1] "factor"
##
## $wind_dir
## [1] "factor"
# Cloud cover still has negative value... is this na?
unique(data_final_aggr$cloud_cover) |> sort() |> head(100)
## [1] -1 0 1 2 3 4 5 6 7 8
## Levels: -1 0 1 2 3 4 5 6 7 8
length(which(data_final_aggr$cloud_cover == -1))
## [1] 43
# cloud cover => inspect -1
# Identify data as NA and clear df from them thereafter.
data_final_aggr$cloud_cover[which(data_final_aggr$cloud_cover == -1)] <- NA
data_final_aggr <- data_final_aggr |> na.omit()
# Create dummies for months as well
data_final_aggr$Month <- as.factor(substr(data_final_aggr$Date_Hour, 6, 7))
# Recall once more: Not entire period has data available!
# This means there will be gaps of the model.
plot(table_availability$Date_Hour, table_availability$Zählstelle,# type = 'l',
col = ifelse(table_availability$Zählstelle == 5, 'darkgreen', 'grey'),
main = "Data availability", sub = "For green dots, all 5 stations reported")
# FINAL
format(object.size(data_final_aggr), units = "MB") # check for size
## [1] "6.3 Mb"
saveRDS(data_final_aggr, file = paste(wd, "data/final/data_final_bike_weather.RDS", sep = "/"))
# CLEAR environment to continue with only the relevant data set
rm(list = setdiff(ls(), "data_final_aggr"))
# } #end of if process raw data.
Test split is used for comparison metrics at the end of every model
df <- data_final_aggr
names(df) # Print variables
## [1] "Date_Hour" "Bikes_Counted" "visibility"
## [4] "air_pressure" "rel_humidity" "air_temp"
## [7] "sunshine_duration" "precip_h_mm" "precip_type"
## [10] "wind_highest_last_h" "wind_speed" "cloud_cover"
## [13] "Hour" "Weekday" "snow_cover_cm"
## [16] "Holiday" "School_holidays" "wind_dir"
## [19] "Month"
###################################################
# Removed a part of the data when testing to limit run time
df <- df[sample(nrow(df), size = floor(0.100 * nrow(df))), ] # influences model run time a lot
# Set to 100% of the data for the final rendering
###################################################
set.seed(123) # for reproducibility
n <- nrow(df)
train_index <- sample(seq_len(n), size = 0.8 * n)
train <- df[train_index, ]
test <- df[-train_index, ]
First lm analysis: Model on all variables in the Data set, except for Date_Hour
lm_exp <- lm(data_final_aggr, formula = Bikes_Counted ~ . - Date_Hour)
lm_exp |> summary()
##
## Call:
## lm(formula = Bikes_Counted ~ . - Date_Hour, data = data_final_aggr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -265.92 -36.16 -2.93 24.98 1102.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -398.36544 41.42035 -9.618 < 2e-16 ***
## visibility -0.16207 0.02418 -6.702 2.08e-11 ***
## air_pressure 0.43533 0.03971 10.962 < 2e-16 ***
## rel_humidity -0.78779 0.03642 -21.630 < 2e-16 ***
## air_temp 3.89467 0.09595 40.591 < 2e-16 ***
## sunshine_duration 1.36000 0.02335 58.254 < 2e-16 ***
## precip_h_mm -1.12857 0.66552 -1.696 0.089935 .
## precip_type2 -42.96897 36.56421 -1.175 0.239934
## precip_type4 -17.01243 9.63589 -1.766 0.077481 .
## precip_type6 -11.92730 0.99994 -11.928 < 2e-16 ***
## precip_type7 10.69082 5.34677 1.999 0.045560 *
## precip_type8 5.10923 4.49564 1.136 0.255758
## wind_highest_last_h 0.55203 0.26502 2.083 0.037260 *
## wind_speed -4.68807 0.40310 -11.630 < 2e-16 ***
## cloud_cover1 -4.07176 1.52886 -2.663 0.007741 **
## cloud_cover2 1.55867 1.70823 0.912 0.361536
## cloud_cover3 -3.56006 1.76602 -2.016 0.043819 *
## cloud_cover4 -5.10305 1.83424 -2.782 0.005403 **
## cloud_cover5 -2.59075 1.78590 -1.451 0.146877
## cloud_cover6 -1.46199 1.64113 -0.891 0.373018
## cloud_cover7 -3.68954 1.38363 -2.667 0.007665 **
## cloud_cover8 -2.75783 1.48879 -1.852 0.063975 .
## Hour2 2.27745 2.12311 1.073 0.283414
## Hour3 6.23293 2.12382 2.935 0.003339 **
## Hour4 17.26044 2.12648 8.117 4.88e-16 ***
## Hour5 28.76377 2.13208 13.491 < 2e-16 ***
## Hour6 26.27472 2.14530 12.248 < 2e-16 ***
## Hour7 20.91949 2.16238 9.674 < 2e-16 ***
## Hour8 33.27828 2.18028 15.263 < 2e-16 ***
## Hour9 55.42019 2.20487 25.135 < 2e-16 ***
## Hour10 69.31265 2.22275 31.183 < 2e-16 ***
## Hour11 75.93243 2.23874 33.917 < 2e-16 ***
## Hour12 87.43285 2.25154 38.832 < 2e-16 ***
## Hour13 94.30070 2.26033 41.720 < 2e-16 ***
## Hour14 90.20336 2.26427 39.838 < 2e-16 ***
## Hour15 76.85222 2.25732 34.046 < 2e-16 ***
## Hour16 51.40162 2.23939 22.953 < 2e-16 ***
## Hour17 22.92969 2.21399 10.357 < 2e-16 ***
## Hour18 0.15166 2.18812 0.069 0.944744
## Hour19 -9.25724 2.16377 -4.278 1.89e-05 ***
## Hour20 -7.70772 2.14761 -3.589 0.000332 ***
## Hour21 -6.44862 2.13384 -3.022 0.002512 **
## Hour22 -4.95042 2.12671 -2.328 0.019930 *
## Hour23 -3.15770 2.12332 -1.487 0.136980
## WeekdayMon -0.01278 1.17585 -0.011 0.991326
## WeekdaySat 21.65892 1.17080 18.499 < 2e-16 ***
## WeekdaySun 60.16976 1.17372 51.264 < 2e-16 ***
## WeekdayThu 3.33215 1.17152 2.844 0.004453 **
## WeekdayTue 0.57126 1.17088 0.488 0.625629
## WeekdayWed 3.11504 1.17357 2.654 0.007949 **
## snow_cover_cm 0.01569 0.62413 0.025 0.979940
## Holiday1 59.93505 2.75290 21.772 < 2e-16 ***
## School_holidays1 -0.95914 0.92446 -1.038 0.299500
## wind_dirN -0.56714 1.59339 -0.356 0.721893
## wind_dirNE -5.44559 1.60769 -3.387 0.000707 ***
## wind_dirNW -3.63588 1.74741 -2.081 0.037463 *
## wind_dirS 0.89384 1.54368 0.579 0.562568
## wind_dirSE -3.38121 1.59239 -2.123 0.033728 *
## wind_dirSW 3.43129 1.54751 2.217 0.026607 *
## wind_dirW -1.84920 1.58038 -1.170 0.241967
## Month02 -5.00103 1.77048 -2.825 0.004735 **
## Month03 1.42147 1.76866 0.804 0.421574
## Month04 7.67926 1.81994 4.220 2.45e-05 ***
## Month05 -1.06303 1.93444 -0.550 0.582647
## Month06 -15.14071 2.14612 -7.055 1.75e-12 ***
## Month07 -12.86941 2.20352 -5.840 5.24e-09 ***
## Month08 -10.58036 2.19436 -4.822 1.43e-06 ***
## Month09 -4.41726 2.01266 -2.195 0.028187 *
## Month10 -0.36999 1.82601 -0.203 0.839431
## Month11 2.43963 1.72678 1.413 0.157714
## Month12 1.72245 1.72095 1.001 0.316894
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 73.05 on 54448 degrees of freedom
## Multiple R-squared: 0.5599, Adjusted R-squared: 0.5594
## F-statistic: 989.7 on 70 and 54448 DF, p-value: < 2.2e-16
lm_exp_pred <- data.frame(Date = data_final_aggr$Date_Hour,
Actual = data_final_aggr$Bikes_Counted,
Pred = predict(lm_exp, data_final_aggr))
# Note that negative predictions are possible
min(lm_exp_pred$Pred)
## [1] -103.7233
# any(grepl("2021-05", data_final_aggr$Date_Hour))
# data_final_aggr[which(grepl("2021-05", data_final_aggr$Date_Hour)),]
library(ggplot2)
# Get some samples for plotting the model predictions
set.seed(06124); samples <- sample(1:(nrow(lm_exp_pred) - 150), 20)
for (i in 1:length(samples)) {
plot <- ggplot(data = lm_exp_pred[samples[i]:(samples[i] + 150),], aes(x = Date, y = Actual), alpha = 0.7) +
geom_line() +
geom_line(aes(y = Pred), color = "red") +
ggtitle("Examples: Actual vs. prediction")
print(plot)
}
# Peaks are not so well identified
# Overprediction in winter!
ggplot(data = lm_exp_pred, aes(x = Date, y = Actual), alpha = 0.7) + geom_line() +
ggtitle("Actual bike traffic through the time")
# Periods with data:
# Note that when not all stations are available, then these dates are not in the sample
# due to our decisions during aggregation. Only points in time that were green in the Data availability
# plot are included!
# Negative values still possible.
# See how fit increases when negative values are all set to zero:
lm_exp_pred$Pred[which(lm_exp_pred$Pred < 0)] <- 0
new_rsq <- 1 - ( sum((lm_exp_pred$Pred - lm_exp_pred$Actual)**2) / sum((lm_exp_pred$Actual - mean(lm_exp_pred$Actual))**2) ); print(new_rsq) # only slightly better than before.
## [1] 0.5793955
Now comes the time of feature engineering: Thinking about possible interaction terms, squares, logs, etc.
# Find out dates with highest mispredictions to find out what might be relevant
lm_exp_pred$eps_sq <- (lm_exp_pred$Actual - lm_exp_pred$Pred) ** 2
lm_exp_pred$eps <- (lm_exp_pred$Actual - lm_exp_pred$Pred)
lm_exp_pred$abs_error <- abs(lm_exp_pred$Actual - lm_exp_pred$Pred)
pred_error_tbl <- aggregate(lm_exp_pred, eps ~ substr(Date, 6, 10), FUN = mean)
pred_error_tbl$`substr(Date, 6, 10)` <- as.Date(pred_error_tbl$`substr(Date, 6, 10)`, format = "%m-%d")
colnames(pred_error_tbl)[1] <- "Date"
plot(pred_error_tbl$Date, pred_error_tbl$eps, main = "Mean eps", sub = "Attention! Absolute eps can be misleading")
No clear seasonal misprediction pattern.
For finding out interesting combinations, I run an lm of
Bikes_Counted on all possible combinations of two
variables, each separately, first, and then add their interaction term.
When the R² of the latter model is way higher, then this speaks for an
important interaction term.
# Create interaction terms
colnames(data_final_aggr)
## [1] "Date_Hour" "Bikes_Counted" "visibility"
## [4] "air_pressure" "rel_humidity" "air_temp"
## [7] "sunshine_duration" "precip_h_mm" "precip_type"
## [10] "wind_highest_last_h" "wind_speed" "cloud_cover"
## [13] "Hour" "Weekday" "snow_cover_cm"
## [16] "Holiday" "School_holidays" "wind_dir"
## [19] "Month"
# Which interaction terms have high explanatory power, all alone?
dummy_results <- matrix(NA, ncol(data_final_aggr) - 1, ncol(data_final_aggr) - 1)
# Create variable combinations to loop through
colnames(dummy_results) <- colnames(data_final_aggr[-which(colnames(data_final_aggr) == "Bikes_Counted")])
rownames(dummy_results) <- colnames(data_final_aggr[-which(colnames(data_final_aggr) == "Bikes_Counted")])
for(var1 in rownames(dummy_results)) {
for(var2 in colnames(dummy_results)) {
if(var1 != var2) { # only for 2 different variables, exclude diagonal
# Look at increase of R² w.r.t. model without interaction term but only both vars
# This helps to evaluate how much additional power comes from the interaction term
formula1 <- as.formula(paste("Bikes_Counted ~ ", var1, " + ", var2, sep = ""))
formula2 <- as.formula(paste("Bikes_Counted ~ ", var1, " + ", var2, " + ", var1, ":", var2, sep = ""))
rsq_1 <- summary(lm(data = data_final_aggr, formula = formula1))$r.squared
rsq_2 <- summary(lm(data = data_final_aggr, formula = formula2))$r.squared
dummy_results[var1, var2] <- rsq_2 - rsq_1
} else {
dummy_results[var1, var2] <- NA
}
}
}
# Inspect
View(as.data.frame(dummy_results))
heatmap(dummy_results, Rowv = NA, Colv = "Rowv")
# Store in format that is easier to use
res_flat <- as.vector(dummy_results)
# Get the names of the possible combinations
names(res_flat) <- apply(expand.grid(rownames(dummy_results), colnames(dummy_results)), 1, paste, collapse = ":")
# Only rows 1, 3, ... needed -> always double occurrence because of symmetric matrix
res_flat <- res_flat[seq(1, length(res_flat), 2)]
# Find 50 highest increases of R²
sort(res_flat, decreasing = TRUE)[1:50]
## Weekday:Hour Weekday:sunshine_duration
## 0.088786223 0.068515676
## cloud_cover:Hour air_temp:Hour
## 0.066838099 0.064427108
## Weekday:rel_humidity air_temp:Weekday
## 0.046916448 0.036676170
## Weekday:air_temp air_temp:Month
## 0.036676170 0.030358755
## wind_dir:wind_highest_last_h wind_highest_last_h:wind_dir
## 0.025438644 0.025438644
## wind_highest_last_h:cloud_cover cloud_cover:wind_highest_last_h
## 0.023276072 0.023276072
## wind_dir:wind_speed air_temp:rel_humidity
## 0.019849925 0.017111432
## air_pressure:wind_highest_last_h wind_highest_last_h:air_pressure
## 0.015783914 0.015783914
## wind_highest_last_h:wind_speed Weekday:Month
## 0.015275988 0.014967733
## wind_highest_last_h:Month wind_dir:Hour
## 0.013852503 0.013808261
## cloud_cover:Month cloud_cover:Weekday
## 0.012016789 0.011852918
## Weekday:cloud_cover cloud_cover:wind_speed
## 0.011852918 0.011848038
## air_temp:sunshine_duration air_pressure:air_temp
## 0.011825016 0.011553103
## air_temp:air_pressure air_pressure:Hour
## 0.011553103 0.011351883
## cloud_cover:air_temp air_temp:cloud_cover
## 0.010792805 0.010792805
## cloud_cover:rel_humidity wind_highest_last_h:air_temp
## 0.010113877 0.009482674
## air_temp:wind_highest_last_h air_temp:visibility
## 0.009482674 0.008897519
## wind_dir:Month Holiday:Hour
## 0.008825993 0.008799381
## air_pressure:wind_speed wind_highest_last_h:Hour
## 0.008365509 0.008121863
## precip_h_mm:Hour air_temp:precip_type
## 0.007904556 0.007693044
## Holiday:sunshine_duration wind_highest_last_h:precip_type
## 0.007276480 0.007199604
## Weekday:precip_type air_pressure:Month
## 0.006594162 0.006131581
## air_temp:Holiday Holiday:air_temp
## 0.005639090 0.005639090
## Weekday:visibility Holiday:rel_humidity
## 0.005263972 0.005135211
## wind_dir:cloud_cover cloud_cover:wind_dir
## 0.004936417 0.004936417
# Only use 20 best
twenty_best_interactions <- sort(res_flat, decreasing = TRUE)[1:20] |> names() |>
paste(collapse = " + ")
lm_exp$call
## lm(formula = Bikes_Counted ~ . - Date_Hour, data = data_final_aggr)
# Prepare formula
#paste(twenty_best_interactions, collapse = " + ")
lm_exp_plus_interaction_formula = paste("Bikes_Counted ~ . + ", twenty_best_interactions, " - Date_Hour", sep = "") |> as.formula()
# Train model
lm_exp_plus_interaction <- lm(data = data_final_aggr,
formula = lm_exp_plus_interaction_formula)
summary(lm_exp_plus_interaction)$r.squared - summary(lm_exp)$r.squared
## [1] 0.1845769
# Strong increase! -> roughly 20 pct withb only 20 terms
In the following, it might be interesting to look at other functional forms - in our case, quadratics and cubics. This will be done by running three models of Bikes_Counted on any explanatory variable
Then, again, I will check whether this brought about substantial improvements w. r. t. the baseline.
# Other functional forms: squared?
single_var_sq_results <- matrix(NA, nrow = nrow(dummy_results), ncol = 2)
colnames(single_var_sq_results) <- c("Squares added: Diff. in R²", "Cubics added: Diff. in R²")
rownames(single_var_sq_results) <- rownames(dummy_results)
for(k in rownames(single_var_sq_results)) {
if(is.numeric(data_final_aggr[[k]])){
f1 <- as.formula(paste("Bikes_Counted ~ ", k, sep = ""))
f2 <- as.formula(paste("Bikes_Counted ~ ", k, " + I(", k, "**2)", sep = ""))
f3 <- as.formula(paste("Bikes_Counted ~ ", k, " + I(", k, "**2)", " + I(", k, "**3)", sep = ""))
baseline_rsq <- summary(lm(data = data_final_aggr, formula = f1))$r.squared
squares_rsq <- summary(lm(data = data_final_aggr, formula = f2))$r.squared
cubics_rsq <- summary(lm(data = data_final_aggr, formula = f3))$r.squared
# Increases in R² thanks to squares / quadratics
single_var_sq_results[k, 1] <- squares_rsq - baseline_rsq
single_var_sq_results[k, 2] <- cubics_rsq - baseline_rsq
}
}
single_var_sq_results <- single_var_sq_results |> na.omit()
data_mat <- as.matrix(single_var_sq_results[order(single_var_sq_results[,1], decreasing = TRUE),])
rownames_vec <- rownames(data_mat)
cols <- c("blue", "red")
# Plot these improvements in R²
{
matplot(data_mat, pch = 19, col = cols, xaxt = "n",
xlab = "Variable", ylab = "ΔR²", main = "R² - Increase through transformations")
axis(1, at = 1:nrow(data_mat), labels = rownames_vec, las = 2, cex.axis = 0.5)
legend("topright", legend = colnames(data_mat), col = cols, pch = 19)
}
# useful:
# wind_highest_last_h as **2, air_temp as **2 and **3, rel_humidity as **2, air_pressure as **2
added_functional_forms <- paste("I(wind_highest_last_h ** 2) + I(air_temp ** 3) + I(air_temp ** 2) + I(rel_humidity ** 2) + I(air_pressure ** 2)")
# Prepare formula as object, can be used later
lm_exp_plus_interaction_functforms_formula <- paste(
paste(deparse(lm_exp_plus_interaction_formula), collapse = ""), " + ",
added_functional_forms, sep = "") |> as.formula()
lm_exp_plus_interaction_functforms <- lm(data = data_final_aggr,
formula = lm_exp_plus_interaction_functforms_formula)
summary(lm_exp_plus_interaction_functforms)$r.squared - summary(lm_exp)$r.squared
## [1] 0.1972999
# Only minimal increase in fit... given it's 5 additional variables
summary(lm_exp_plus_interaction_functforms)$r.squared - summary(lm_exp_plus_interaction)$r.squared
## [1] 0.01272301
Let’s quickly visualize the predictions:
predpower_comparison <- data.frame(
Date_Time = data_final_aggr$Date_Hour,
Actual = data_final_aggr$Bikes_Counted,
`LM_Baseline` = predict(lm_exp, data_final_aggr),
`LM_Interaction_terms` = predict(lm_exp_plus_interaction, data_final_aggr),
`LM_Interaction_terms_FunctForms` = predict(lm_exp_plus_interaction_functforms, data_final_aggr)
)
# We want no negative predictions!
predpower_comparison[predpower_comparison < 0] <- 0
prettypal <- c('Actual' = 'darkgrey', 'LM Baseline' = "#e41a1c", 'LM + Interact.' = "#377eb8", 'LM + Interact. + Fct. Forms' = "#4daf4a")#, "#984ea3", "#ff7f00")
ggplot(data = predpower_comparison, aes(x = Date_Time)) +
geom_line(aes(y = Actual, color = "Actual")) +
geom_line(aes(y = LM_Baseline, color = 'LM Baseline')) +
geom_line(aes(y = LM_Interaction_terms, color = 'LM + Interact.')) +
geom_line(aes(y = LM_Interaction_terms_FunctForms, color = 'LM + Interact. + Fct. Forms')) +
scale_color_manual(values = prettypal) +
ggtitle("Models vs. actual values")
It is for sure more sensible to sample some periods:
n_sample_periods <- 20
length_sample_periods <- 100
for(i in 1:n_sample_periods) {
subsample_startrow <- sample(1:(nrow(predpower_comparison) - length_sample_periods), 1)
subsample_endrow <- subsample_startrow + length_sample_periods
plotsample <- predpower_comparison[subsample_startrow:subsample_endrow,]
plot <- ggplot(data = plotsample, aes(x = Date_Time)) +
geom_line(aes(y = Actual, color = "Actual")) +
geom_line(aes(y = LM_Baseline, color = 'LM Baseline')) +
geom_line(aes(y = LM_Interaction_terms, color = 'LM + Interact.')) +
geom_line(aes(y = LM_Interaction_terms_FunctForms, color = 'LM + Interact. + Fct. Forms')) +
scale_color_manual(values = prettypal) +
ggtitle("Example periods: Models vs. actual value")
print(plot)
}
Is the partly very good explanatory power due to over-fitting or would it also on random testing data (optimistic, because lm is not that much prone to overfitting with \(n_{obs} >> n_{explanatory variables}\))? Therefore, I am performing a hold-out cross-validation with 80-20 training-testing split.
# Prepare function to get rid of negative predictions
no_negative_predictions <- function(array) {array[which(array < 0)] <- 0; return(array)}
# Prepare functions to get metrics quickly
calc_rsq <- function(actual, predicted) {
SSR <- sum((actual - predicted) ** 2)
SST <- (sum((actual - mean(actual)) ** 2))
rsq <- 1 - SSR / SST
return(rsq)
}
calc_mae <- function(actual, predicted) {
mae <- mean(abs(actual - predicted))
return(mae)
}
calc_rmse <- function(actual, predicted) {
rmse <- sqrt(mean((actual - predicted) ** 2))
return(rmse)
}
Note: The abbreviations mean: - LM: Baseline lm -
LM + IA: LM + 20 most promising interaction
terms - LM + IA + FF: LM + IA + 5 most
promising functional forms
set.seed(0345)
iter_cv <- 5
cv_res_matrix <- matrix(NA, nrow = iter_cv, ncol = (3 * 2 * 3)) # 3 models - 2 splits - 3 metrics
models <- c("LM", "LM + IA", "LM + IA + FF")
splits <- c("Train", "Test")
metrics <- c("RMSE", "MAE", "R2")
colnames(cv_res_matrix) <- paste(rep(models, each = 6),
rep(rep(splits, each = 3), 3),
rep(metrics, 6),
sep = ", ")
# Store cross-validation RMSE
for(j in 1:iter_cv) {
# Hold-out CV: always get 80% for training and 20% for testing
train_sample <- sample(1:nrow(data_final_aggr), floor(0.8 * nrow(data_final_aggr)))
data_train <- data_final_aggr[train_sample,]
data_test <- data_final_aggr[- train_sample,]
# Train data and find performance on training set
lm0 <- lm(formula = Bikes_Counted ~ . - Date_Hour, data = data_train)
lm1 <- lm(formula = lm_exp_plus_interaction_formula, data = data_train)
lm2 <- lm(formula = lm_exp_plus_interaction_functforms_formula, data = data_train)
train_pred_0 <- predict(lm0, data_train) |> no_negative_predictions()
train_pred_1 <- predict(lm1, data_train) |> no_negative_predictions()
train_pred_2 <- predict(lm2, data_train) |> no_negative_predictions()
# Predict on testing set as well
test_pred_0 <- predict(lm0, data_test) |> no_negative_predictions()
test_pred_1 <- predict(lm1, data_test) |> no_negative_predictions()
test_pred_2 <- predict(lm2, data_test) |> no_negative_predictions()
cv_res_matrix[j, "LM, Train, RMSE"] <- calc_rmse(data_train$Bikes_Counted, train_pred_0)
cv_res_matrix[j, "LM, Train, MAE"] <- calc_mae(data_train$Bikes_Counted, train_pred_0)
cv_res_matrix[j, "LM, Train, R2"] <- calc_rsq(data_train$Bikes_Counted, train_pred_0)
cv_res_matrix[j, "LM, Test, RMSE"] <- calc_rmse(data_test$Bikes_Counted, test_pred_0)
cv_res_matrix[j, "LM, Test, MAE"] <- calc_mae(data_test$Bikes_Counted, test_pred_0)
cv_res_matrix[j, "LM, Test, R2"] <- calc_rsq(data_test$Bikes_Counted, test_pred_0)
cv_res_matrix[j, "LM + IA, Train, RMSE"] <- calc_rmse(data_train$Bikes_Counted, train_pred_1)
cv_res_matrix[j, "LM + IA, Train, MAE"] <- calc_mae(data_train$Bikes_Counted, train_pred_1)
cv_res_matrix[j, "LM + IA, Train, R2"] <- calc_rsq(data_train$Bikes_Counted, train_pred_1)
cv_res_matrix[j, "LM + IA, Test, RMSE"] <- calc_rmse(data_test$Bikes_Counted, test_pred_1)
cv_res_matrix[j, "LM + IA, Test, MAE"] <- calc_mae(data_test$Bikes_Counted, test_pred_1)
cv_res_matrix[j, "LM + IA, Test, R2"] <- calc_rsq(data_test$Bikes_Counted, test_pred_1)
cv_res_matrix[j, "LM + IA + FF, Train, RMSE"] <- calc_rmse(data_train$Bikes_Counted, train_pred_2)
cv_res_matrix[j, "LM + IA + FF, Train, MAE"] <- calc_mae(data_train$Bikes_Counted, train_pred_2)
cv_res_matrix[j, "LM + IA + FF, Train, R2"] <- calc_rsq(data_train$Bikes_Counted, train_pred_2)
cv_res_matrix[j, "LM + IA + FF, Test, RMSE"] <- calc_rmse(data_test$Bikes_Counted, test_pred_2)
cv_res_matrix[j, "LM + IA + FF, Test, MAE"] <- calc_mae(data_test$Bikes_Counted, test_pred_2)
cv_res_matrix[j, "LM + IA + FF, Test, R2"] <- calc_rsq(data_test$Bikes_Counted, test_pred_2)
}
print(cv_res_matrix |> as.data.frame())
## LM, Train, RMSE LM, Train, MAE LM, Train, R2 LM, Test, RMSE LM, Test, MAE
## 1 71.84773 38.13420 0.5794026 69.52141 37.47217
## 2 71.94497 38.05296 0.5779706 68.89111 37.40996
## 3 71.86519 37.94488 0.5772654 69.40460 37.70370
## 4 72.01739 38.01281 0.5766807 68.81103 37.61346
## 5 71.57650 38.03429 0.5781974 70.53839 37.11994
## LM, Test, R2 LM + IA, Train, RMSE LM + IA, Train, MAE LM + IA, Train, R2
## 1 0.5780675 54.64346 27.71116 0.7567142
## 2 0.5869450 54.92215 27.76031 0.7540556
## 3 0.5875923 55.04246 27.75410 0.7520144
## 4 0.5897026 55.14801 27.76916 0.7517707
## 5 0.5841033 54.75681 27.72497 0.7531434
## LM + IA, Test, RMSE LM + IA, Test, MAE LM + IA, Test, R2
## 1 54.38837 27.77833 0.7417633
## 2 53.18438 27.62686 0.7538217
## 3 52.60181 27.58089 0.7631073
## 4 52.31573 27.56322 0.7628369
## 5 53.85173 27.28904 0.7575994
## LM + IA + FF, Train, RMSE LM + IA + FF, Train, MAE LM + IA + FF, Train, R2
## 1 53.12346 27.00214 0.7700608
## 2 53.38092 27.04676 0.7676654
## 3 53.59656 27.07764 0.7648719
## 4 53.59619 27.03846 0.7655441
## 5 53.23952 27.03315 0.7666345
## LM + IA + FF, Test, RMSE LM + IA + FF, Test, MAE LM + IA + FF, Test, R2
## 1 53.02584 27.09378 0.7545399
## 2 51.86227 27.08503 0.7659091
## 3 50.93540 26.76405 0.7778789
## 4 51.10568 26.95470 0.7736810
## 5 52.46829 26.61718 0.7698938
# cv_res_matrix |> as.data.frame() |> View()
mean_of_all_iters <- apply(cv_res_matrix, 2, mean)
Note: The following section shows some alternations of the linear model (time lags, aggregations of weather of pre-day), however, the most relevant are the models that were just run. These will also be the ones that need to be compared to the more advanced models.
The actual performance of the models is better, if you will, because we can physically rule out negative bike traffic. Therefore, I get a measure which uses predictions without negative values (negative values replaced by 0):
# Calculate new R-sq's with no negative predictions -> should have even higher accuracy.
rsq_nonegative <- function(model) {
actual_series <- model$model$Bikes_Counted
fitted_values_series <- model$fitted.values
fitted_values_series[fitted_values_series < 0] <- 0
RSS <- sum((actual_series - fitted_values_series) ** 2)
TSS <- sum((actual_series - mean(actual_series)) ** 2)
rsq_nonegativepreds <- 1 - RSS / TSS
return(rsq_nonegativepreds)
}
# Note: Here, I check models which saw the entire dataset
lm_exp |> rsq_nonegative()
## [1] 0.5793955
summary(lm_exp)$r.squared
## [1] 0.5599292
lm_exp_plus_interaction |> rsq_nonegative()
## [1] 0.7546476
summary(lm_exp_plus_interaction)$r.squared
## [1] 0.7445061
lm_exp_plus_interaction_functforms |> rsq_nonegative()
## [1] 0.7678233
summary(lm_exp_plus_interaction_functforms)$r.squared
## [1] 0.7572291
Now: Run an additional model: This one takes into account a selected number of time lags + all explanatory variables.
# Define own function to repeat for different models and number of timelags
lm_with_timelags <- function(dta, formula, timelags_to_add) {
timeseries <- dta$Bikes_Counted
for(nlag in 1:timelags_to_add) {
name <- paste("Bikes_Counted_L", nlag, sep = "")
raw_vec <- rep(NA, nrow(dta))
raw_vec[(nlag + 1) : (length(raw_vec))] <-
timeseries[1:(length(raw_vec) - nlag)]
dta[[name]] <- raw_vec
}
dta <- dta |> na.omit()
formula <- as.formula(formula)
lm_final <- lm(formula = formula, data = dta)
return(lm_final)
}
lm_exp_tl_1 <- lm_with_timelags(dta = data_final_aggr,
formula = Bikes_Counted ~ . - Date_Hour,
timelags_to_add = 1)
lm_exp_tl_6 <- lm_with_timelags(dta = data_final_aggr,
formula = Bikes_Counted ~ . - Date_Hour,
timelags_to_add = 6)
lm_exp_plus_interaction_tl_1 <- lm_with_timelags(dta = data_final_aggr,
formula = lm_exp_plus_interaction_formula,
timelags_to_add = 1)
lm_exp_plus_interaction_tl_6 <- lm_with_timelags(dta = data_final_aggr,
formula = lm_exp_plus_interaction_formula,
timelags_to_add = 6)
lm_exp_plus_interaction_functforms_tl_1 <- lm_with_timelags(dta = data_final_aggr,
formula = lm_exp_plus_interaction_functforms_formula,
timelags_to_add = 1)
lm_exp_plus_interaction_functforms_tl_6 <- lm_with_timelags(dta = data_final_aggr,
formula = lm_exp_plus_interaction_functforms_formula,
timelags_to_add = 6)
# Make use of own pre-defined function for R² without negative predictions
lm_exp_tl_1 |> rsq_nonegative()
## [1] 0.9206689
lm_exp_tl_6 |> rsq_nonegative()
## [1] 0.9452319
lm_exp_plus_interaction_tl_1 |> rsq_nonegative()
## [1] 0.944513
lm_exp_plus_interaction_tl_6 |> rsq_nonegative()
## [1] 0.955461
lm_exp_plus_interaction_functforms_tl_1 |> rsq_nonegative()
## [1] 0.9446086
lm_exp_plus_interaction_functforms_tl_6 |> rsq_nonegative()
## [1] 0.9556802
# Predictions of lm_exp_tl_6 - only roughly 1% worse with R² w.r.t. the model with
# all interaction terms, although far lower complexity
# Prediction frame for visualization
lm_exp_tl_6_preds <- data.frame(Date_Time = lm_exp_tl_6$model$Date_Hour,
Actual = lm_exp_tl_6$model$Bikes_Counted,
Predicted = lm_exp_tl_6$fitted.values)
lm_exp_tl_6_preds$Predicted[lm_exp_tl_6_preds$Predicted < 0] <- 0
Now really the strongest effect of the explanatory power is from the timelags. This means, if we were to do forecasting for the next hour, our performance will be great in most cases.
Again, I sample some parts of the predicted data and compare it with the actual values:
n_sample_periods <- 20
length_sample_periods <- 100
for(i in 1:n_sample_periods) {
subsample_startrow <- sample(1:(nrow(lm_exp_tl_6_preds) - length_sample_periods), 1)
subsample_endrow <- subsample_startrow + length_sample_periods
plotsample <- lm_exp_tl_6_preds[subsample_startrow:subsample_endrow,]
plot <- ggplot(data = plotsample, aes(x = Date_Time)) +
geom_line(aes(y = Actual, color = "Actual")) +
geom_line(aes(y = Predicted, color = 'Simple LM + 6 Timelags')) +
scale_color_manual(values = c("Actual" = prettypal[[2]],
"Simple LM + 6 Timelags" = prettypal[[4]])) +
ggtitle("Model with timelags: Prediction accuracy")
print(plot)
}
There is a great fit for the very basic lm when 6 lags (= 6 previous
hours) are added.
…with aggregated meteorological data of day before.
If there is any potential practical application case for such a project, e. g. for traffic management or gastronomical planning, then users might like to know at the beginning of the day already what the traffic will look like across the whole day.
For that purpose, I will aggregate means/sums/maximums of meteorological data from the day before, and ex-ante known criteria such as the weekday, the month, and the hour.
# Need mean
daily_means <- aggregate(data_final_aggr,
cbind(visibility, air_pressure, rel_humidity,
air_temp, wind_speed, cloud_cover, snow_cover_cm) ~
substr(Date_Hour, 1, 10),
FUN = mean)
# Need sum
daily_sum <- aggregate(data_final_aggr,
cbind(Bikes_Counted, sunshine_duration,
precip_h_mm) ~
substr(Date_Hour, 1, 10),
FUN = sum)
# Need max
daily_max <- aggregate(data_final_aggr,
cbind(wind_highest_last_h) ~
substr(Date_Hour, 1, 10),
FUN = max)
colnames(daily_max)[1] <- "Date"
colnames(daily_means)[1] <- "Date"
colnames(daily_sum)[1] <- "Date"
# Give meaningful prefixes
colnames(daily_max)[2:ncol(daily_max)] <- paste0("max_", colnames(daily_max)[2:ncol(daily_max)])
colnames(daily_means)[2:ncol(daily_means)] <- paste0("mean_", colnames(daily_means)[2:ncol(daily_means)])
colnames(daily_sum)[2:ncol(daily_sum)] <- paste0("sum_", colnames(daily_sum)[2:ncol(daily_sum)])
# Variables for which I can know the values ex-ante:
data_nolag <- data_final_aggr[, c("Date_Hour", "Hour", "Weekday", "Holiday",
"School_holidays", "Month", "Bikes_Counted")]
# Bring the different metrics together
data_daily <- merge(daily_sum, merge(daily_max, daily_means, by = "Date"), by = "Date")
data_daily$Date <- data_daily$Date |> as.Date()
# Get the day for which the prediction is useful: The next day.
# This will be the day where I merge it
data_daily$onedaylater <- data_daily$Date + 1
data_nolag$Date <- data_nolag$Date |> as.Date()
# 1db = one day before. Merging brings together ex-ante known facts about the day plus
# the meteorological aggregated data of the day before
data_with_vals_1db <- merge(data_nolag, data_daily, by.x = "Date", by.y = "onedaylater",
all.x = TRUE)
data_with_vals_1db <- data_with_vals_1db[order(data_with_vals_1db$Date_Hour),]
# Note: data from pre-day used from 1 AM on
data_with_vals_1db <- data_with_vals_1db[data_with_vals_1db$Date_Hour >= "2016-01-27",]
lm(data_with_vals_1db, formula = Bikes_Counted ~ . - Date - Date_Hour - `Date.y`) |>
summary()
##
## Call:
## lm(formula = Bikes_Counted ~ . - Date - Date_Hour - Date.y, data = data_with_vals_1db)
##
## Residuals:
## Min 1Q Median 3Q Max
## -253.92 -42.46 -6.85 28.62 1142.53
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.711e+02 5.011e+01 -19.382 < 2e-16 ***
## Hour2 4.742e-01 2.389e+00 0.199 0.842652
## Hour3 3.102e+00 2.389e+00 1.298 0.194201
## Hour4 1.330e+01 2.390e+00 5.565 2.64e-08 ***
## Hour5 3.144e+01 2.391e+00 13.150 < 2e-16 ***
## Hour6 4.135e+01 2.391e+00 17.295 < 2e-16 ***
## Hour7 4.828e+01 2.391e+00 20.198 < 2e-16 ***
## Hour8 7.275e+01 2.390e+00 30.444 < 2e-16 ***
## Hour9 1.072e+02 2.389e+00 44.868 < 2e-16 ***
## Hour10 1.285e+02 2.389e+00 53.808 < 2e-16 ***
## Hour11 1.401e+02 2.389e+00 58.638 < 2e-16 ***
## Hour12 1.556e+02 2.389e+00 65.127 < 2e-16 ***
## Hour13 1.641e+02 2.389e+00 68.677 < 2e-16 ***
## Hour14 1.606e+02 2.389e+00 67.239 < 2e-16 ***
## Hour15 1.455e+02 2.390e+00 60.876 < 2e-16 ***
## Hour16 1.129e+02 2.390e+00 47.237 < 2e-16 ***
## Hour17 7.551e+01 2.390e+00 31.599 < 2e-16 ***
## Hour18 4.184e+01 2.389e+00 17.513 < 2e-16 ***
## Hour19 1.954e+01 2.389e+00 8.177 2.97e-16 ***
## Hour20 7.936e+00 2.390e+00 3.321 0.000897 ***
## Hour21 3.542e+00 2.387e+00 1.484 0.137786
## Hour22 1.994e+00 2.387e+00 0.835 0.403734
## Hour23 9.400e-01 2.389e+00 0.393 0.693956
## WeekdayMon -1.347e+01 1.444e+00 -9.328 < 2e-16 ***
## WeekdaySat 2.323e+01 1.319e+00 17.610 < 2e-16 ***
## WeekdaySun 5.612e+01 1.333e+00 42.115 < 2e-16 ***
## WeekdayThu 3.139e+00 1.319e+00 2.379 0.017346 *
## WeekdayTue 7.890e-01 1.325e+00 0.595 0.551533
## WeekdayWed 5.806e+00 1.322e+00 4.392 1.12e-05 ***
## Holiday1 5.625e+01 3.100e+00 18.148 < 2e-16 ***
## School_holidays1 -1.004e+00 1.045e+00 -0.961 0.336539
## Month02 1.012e+01 2.004e+00 5.049 4.47e-07 ***
## Month03 2.890e+01 2.036e+00 14.194 < 2e-16 ***
## Month04 4.202e+01 2.115e+00 19.870 < 2e-16 ***
## Month05 4.527e+01 2.229e+00 20.311 < 2e-16 ***
## Month06 3.814e+01 2.481e+00 15.373 < 2e-16 ***
## Month07 4.047e+01 2.549e+00 15.877 < 2e-16 ***
## Month08 3.785e+01 2.551e+00 14.838 < 2e-16 ***
## Month09 3.069e+01 2.330e+00 13.173 < 2e-16 ***
## Month10 1.729e+01 2.074e+00 8.335 < 2e-16 ***
## Month11 9.490e+00 1.951e+00 4.863 1.16e-06 ***
## Month12 3.003e+00 1.938e+00 1.550 0.121264
## sum_Bikes_Counted 1.112e-02 4.608e-04 24.132 < 2e-16 ***
## sum_sunshine_duration -2.121e-02 3.307e-03 -6.412 1.45e-10 ***
## sum_precip_h_mm 2.292e-01 1.041e-01 2.201 0.027723 *
## max_wind_highest_last_h -5.564e-01 1.809e-01 -3.075 0.002106 **
## mean_visibility 1.393e-01 3.771e-02 3.694 0.000221 ***
## mean_air_pressure 9.077e-01 4.775e-02 19.010 < 2e-16 ***
## mean_rel_humidity 2.599e-01 5.788e-02 4.490 7.15e-06 ***
## mean_air_temp 1.864e+00 1.198e-01 15.559 < 2e-16 ***
## mean_wind_speed -2.615e-01 4.200e-01 -0.623 0.533503
## mean_cloud_cover -4.764e+00 3.280e-01 -14.523 < 2e-16 ***
## mean_snow_cover_cm 5.535e-01 7.002e-01 0.791 0.429194
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 82.12 on 54250 degrees of freedom
## (192 observations deleted due to missingness)
## Multiple R-squared: 0.4448, Adjusted R-squared: 0.4443
## F-statistic: 835.8 on 52 and 54250 DF, p-value: < 2.2e-16
The fit is only moderate w. r. t. previously fitted models. However, what happens if we account for interaction terms?
Again, I check for the increase in the model fit in case I include the interaction term additionally to the two single variables.
# Improve performance
checkgrid <- t(combn(colnames(data_with_vals_1db), 2)) |> as.data.frame()
colnames(checkgrid) <- c("var_a", "var_b")
checkgrid$r_sq_increase <- NA
# Don't need to check explained variable
checkgrid <- checkgrid[-which(checkgrid$var_a == "Bikes_Counted" |
checkgrid$var_b == "Bikes_Counted"),]
# Don't want to use date variables
checkgrid <- checkgrid[-which(grepl("Date", checkgrid$var_a) |
grepl("Date", checkgrid$var_b)),]
for(i in 1:nrow(checkgrid)) {
var1 <- checkgrid$var_a[i]
var2 <- checkgrid$var_b[i]
f1 <- paste0("Bikes_Counted ~ ", var1, "+ ", var2) |> as.formula()
f2 <- paste0("Bikes_Counted ~ ", var1, "+ ", var2, " + ", var1, ":", var2) |> as.formula()
m1 <- lm(data_with_vals_1db, formula = f1)
m2 <- lm(data_with_vals_1db, formula = f2)
r_sq_increase <- summary(m2)$r.squared - summary(m1)$r.squared
checkgrid$r_sq_increase[i] <- r_sq_increase
}
# Keep only interactions which increase model performance by more than 1%
View(checkgrid[order(checkgrid$r_sq_increase, decreasing = TRUE),])
add_interactions_combis <- checkgrid[which(checkgrid$r_sq_increase > 0.01),]
added_interactions <- paste(add_interactions_combis$var_a, add_interactions_combis$var_b, sep = ":") |>
paste(collapse = " + ")
# Train model only on these interactions
lm_dailylag_interactions <- lm(data_with_vals_1db,
formula = as.formula(
paste("Bikes_Counted ~ . - Date - Date_Hour - `Date.y`",
added_interactions, sep = " + ")))
lm_dailylag_interactions |> rsq_nonegative()
## [1] 0.6640124
lm_exp_plus_interaction |> rsq_nonegative()
## [1] 0.7546476
If we do not use the hourly data for weather to predict the bike traffic for exactly the same hour and instead rely on the means/maxs/sums of the day before and ex-ante known criteria of the points in time – e. g. holiday, weekday, … – then we are not even 10% worse than in the model which has varying data on the hourly basis! This is quite impressive… But no big surprise, once we think about it that a large share is already explained due to seasonality and times across the day.
rsq_nonegative() function that was defined for the
models.# FUNCTION PREPARATION
## R²
# rsq_nonegative()
## RMSE
get_rmse <- function(model) {
rmse <- sqrt(mean((model$residuals) ** 2))
return(rmse)
}
## MAE
get_mae <- function(model) {
mae <- mean(abs(model$residuals))
return(mae)
}
model_list <- c(
"lm_exp",
"lm_exp_plus_interaction",
"lm_exp_plus_interaction_functforms",
"lm_exp_tl_1",
"lm_exp_tl_6",
"lm_exp_plus_interaction_tl_1",
"lm_exp_plus_interaction_tl_6",
"lm_exp_plus_interaction_functforms_tl_1",
"lm_exp_plus_interaction_functforms_tl_6",
"lm_dailylag_interactions"
)
metrics_df <- data.frame()
# Model = model_list,
# MAE = grep(NA, length(model_list)),
# RMSE = grep(NA, length(model_list)),
# R2 = grep(NA, length(model_list)))
for(m in model_list) {
name <- m
model <- get(m)
thisMAE <- model |> get_mae()
thisRMSE <- model |> get_rmse()
thisR2 <- model |> rsq_nonegative()
entry <- data.frame(Model = name, MAE = thisMAE, RMSE = thisRMSE, R2 = thisR2)
metrics_df <- rbind(metrics_df, entry)
}
The time lag models and the data-of-the-day-before-model are not that bad, but for further comparison, the three relevant models are going to be the baseline model in its two variations with added interaction terms and functional forms.
Note that the Cross-Validation results for those were already calculated:
cv_res_inspection <- cv_res_matrix |> apply(2, mean) |> matrix(nrow = 3) |> t() # get mean of all iterations
colnames(cv_res_inspection) <- c("MAE", "RMSE", "R2")
rownames(cv_res_inspection) <- paste(rep(c("LM", "LM + IA", "LM + IA + FF"), each = 2),
rep(c("Train", "Test"), 3), sep = ": ")
print(cv_res_inspection)
## MAE RMSE R2
## LM: Train 71.85036 38.03583 0.5779033
## LM: Test 69.43331 37.46385 0.5852821
## LM + IA: Train 54.90258 27.74394 0.7535397
## LM + IA: Test 53.26840 27.56767 0.7558257
## LM + IA + FF: Train 53.38733 27.03963 0.7669553
## LM + IA + FF: Test 51.87950 26.90295 0.7683806
Hereby, I recall that the three relevant alternations of the linear model are not prone to overfitting, given the results of the hold-out cross validation with 5 iterations and 80 Train - 20 Test split.
Prepare the necessary information for comparison with later models:
metrics_eriks_models <- metrics_df[
which(metrics_df$Model %in% c("lm_exp", "lm_exp_plus_interaction",
"lm_exp_plus_interaction_functforms")), ]
# metrics_eriks_models$MSE <- metrics_eriks_models$RMSE^2 # not used for final model comparison
newmodelnames <- c("lm_exp" = "LM Baseline",
"lm_exp_plus_interaction" = "LM + IA",
"lm_exp_plus_interaction_functforms" = "LM + IA + FF")
metrics_eriks_models$Model <- newmodelnames[metrics_eriks_models$Model]
print(metrics_eriks_models)
## Model MAE RMSE R2
## 1 LM Baseline 43.34079 73.00132 0.5793955
## 2 LM + IA 30.64337 55.62367 0.7546476
## 3 LM + IA + FF 30.02102 54.22102 0.7678233
library(rpart)
# Initial model
model <- rpart(Bikes_Counted ~ . - Date_Hour, data = train, method = "anova", cp = 0, maxdepth = 5)
# depth = 5 keeps training and cv loss close, without having them both be increased. It. also makes the cp plot visually readable
# 10-fold cross-validation default in rpart
plotcp(model)
best_cp <- model$cptable[which.min(model$cptable[, "xerror"]), "CP"]
best_cp
## [1] 0
# Prune the tree
pruned_model <- prune(model, cp = best_cp)
# Predict on test set
pred <- predict(pruned_model, newdata = test)
# True values
actual <- test$Bikes_Counted
# Metrics
mae <- mean(abs(pred - actual))
rmse <- sqrt(mean((pred - actual)^2))
r2 <- 1 - sum((actual - pred)^2) / sum((actual - mean(actual))^2)
# Output metrics
metrics_tree <- c(MAE = mae, RMSE = rmse, R2 = r2)
# Training MSE
train_pred <- predict(pruned_model, newdata = train)
train_actual <- train$Bikes_Counted
training_mse <- mean((train_actual - train_pred)^2)
# CV MSE (from xerror * RSS in cptable, divide by nrow(train))
cv_mse <- min(model$cptable[, "xerror"]) * mean((train_actual - mean(train_actual))^2)
# Output MSE losses
losses_tree <- c(Training_MSE = training_mse, CV_MSE = cv_mse)
losses_tree
## Training_MSE CV_MSE
## 3033.884 3951.999
Model insight: 10-fold cross-validation is apparently default in rpart package. The anova method uses mse for by default cv. The model would overfit (cv-error >> training-error), limiting tree depth to 5 levels kept the error rates for cv the same, while reducing the gap, leading to a non-overfit model.
Limiting tree depts also make the pruning parameter “best_cp” more stable, with the plot becoming readable. Without limiting tree depth, the plot was a chaos of too many observations.
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
# Remove Date_Hour
train_data <- train[, !(names(train) %in% c("Date_Hour"))]
# Prepare predictors and target
x <- train_data[, -which(names(train_data) == "Bikes_Counted")]
y <- train_data$Bikes_Counted
# Tune mtry with OOB error
best_mtry <- tuneRF(
x = x,
y = y,
mtryStart = 5,
stepFactor = 1.5,
improve = 0.05,
ntreeTry = 100,
trace = TRUE,
plot = TRUE
)
## mtry = 5 OOB error = 2803.116
## Searching left ...
## mtry = 4 OOB error = 2950.968
## -0.05274581 0.05
## Searching right ...
## mtry = 7 OOB error = 2682.917
## 0.04288022 0.05
optimal_mtry <- best_mtry[which.min(best_mtry[, "OOBError"]), "mtry"]
# Set up 10-fold cross-validation
ctrl <- trainControl(method = "cv", number = 10)
# Train model with cross-validation using optimal mtry
set.seed(42)
model_cv <- caret::train(
Bikes_Counted ~ .,
data = train_data,
method = "rf", # randomforest through tcaret whic hhandles cross validation
trControl = ctrl,
tuneGrid = data.frame(mtry = optimal_mtry),
ntree = 500,
nodesize = 100,
maxnodes = 10
)
# Extract CV metrics
cv_mse <- model_cv$results$RMSE^2
# Final model from caret (trained on full train data)
model <- model_cv$finalModel
# Training predictions and MSE
train_pred <- predict(model_cv, newdata = train_data)
training_mse <- mean((train_data$Bikes_Counted - train_pred)^2)
# Prepare test data (remove Date_Hour, Bikes_Counted)
test_data <- test[, !(names(test) %in% c("Date_Hour", "Bikes_Counted"))]
# Test predictions and metrics
pred <- predict(model_cv, newdata = test_data)
actual <- test$Bikes_Counted
mae <- mean(abs(pred - actual))
rmse <- sqrt(mean((pred - actual)^2))
r2 <- 1 - sum((actual - pred)^2) / sum((actual - mean(actual))^2)
metrics_ens1 <- c(MAE = mae, RMSE = rmse, R2 = r2)
losses_rf <- c(Training_MSE = training_mse, CV_MSE = cv_mse)
losses_rf
## Training_MSE CV_MSE
## 7117.411 7140.580
Model insight: Using tuneRF gets different results when using a small number of trees. It was noticed that the higher the mtry parameter, the better the model performed. This was a clue to having more trees in tuneRF, which lead to a higher and more optimal mtry hyperparameter. Node size and max nodes were tweaked to reduce the cv error - training error gap. In thge previous tree model the tree depth could be controlled, which helped with overfitting, however this hyperparameter couldn’t be changes with this perticular package.
RandomForest does not have a cros validation feature. However caret can implement this framework with intuitive notation, and same hyperparameter names as the randomforest package.
Note: gbm package does parallel compute (all CPU cores) - can go heavy on the hyperparameters
# install.packages("gbm")
library(gbm)
## Warning: package 'gbm' was built under R version 4.4.3
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
train_data <- train[, !(names(train) %in% c("Date_Hour"))]
test_data <- test[, !(names(test) %in% c("Date_Hour", "Bikes_Counted"))]
set.seed(123)
model_gbm <- gbm(
Bikes_Counted ~ .,
data = train_data,
distribution = "gaussian",
n.trees = 400, # Note the loss plot at the end. after 400 there is low marginal progress
interaction.depth = 2,
shrinkage = 0.05,
cv.folds = 20
)
best_iter <- gbm.perf(model_gbm, method = "cv", plot.it = FALSE) # Plot is made later manually with legend
pred <- predict(model_gbm, newdata = test_data, n.trees = best_iter)
actual <- test$Bikes_Counted
mae <- mean(abs(pred - actual))
rmse <- sqrt(mean((pred - actual)^2))
r2 <- 1 - sum((actual - pred)^2) / sum((actual - mean(actual))^2)
metrics_ens2 <- c(MAE = mae, RMSE = rmse, R2 = r2)
# Training loss (MSE) at each iteration
train_loss <- model_gbm$train.error
# Validation loss (CV MSE) at each iteration
val_loss <- model_gbm$cv.error
# Plot both losses
plot(train_loss, type = "l", col = "grey", lwd = 2, ylab = "MSE", xlab = "Number of Trees")
lines(val_loss, col = "orange", lwd = 2)
legend("topright", legend = c("Training MSE", "CV MSE"), col = c("grey", "orange"), lwd = 2)
Model insight:
Many hyperparameters were tweaked. The most influential was interaction depth. With a low number this reduces the complexity of the model, making the training more stable. In the plot it can be seen that after 400 trees, there is no more gain to be had with cv-error, while the model starts to overfit.
library(keras)
## Warning: package 'keras' was built under R version 4.4.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(recipes)
## Warning: package 'recipes' was built under R version 4.4.3
##
## Attaching package: 'recipes'
## The following object is masked from 'package:stats':
##
## step
library(tensorflow)
## Warning: package 'tensorflow' was built under R version 4.4.3
##
## Attaching package: 'tensorflow'
## The following object is masked from 'package:caret':
##
## train
library(reticulate)
## Warning: package 'reticulate' was built under R version 4.4.3
# Remove unused columns
train_data <- train %>% select(-Date_Hour)
test_data <- test %>% select(-Date_Hour, -Bikes_Counted)
# Define preprocessing using `recipes`
rec <- recipe(Bikes_Counted ~ ., data = train_data) %>%
step_dummy(all_nominal_predictors()) %>% # one-hot encode categoricals
step_center(all_numeric_predictors()) %>%
step_scale(all_numeric_predictors()) %>%
prep()
## Warning: ! The following column has zero variance so scaling cannot be used:
## precip_type_X2.
## ℹ Consider using ?step_zv (`?recipes::step_zv()`) to remove those columns
## before normalizing.
# Apply preprocessing
x_train <- bake(rec, new_data = train_data) %>% select(-Bikes_Counted) %>% as.matrix()
x_test <- bake(rec, new_data = test_data) %>% as.matrix()
y_train <- train_data$Bikes_Counted
y_test <- test$Bikes_Counted
# Build Keras model
model <- keras_model_sequential() %>%
layer_dense(units = 32, activation = "relu", input_shape = ncol(x_train)) %>%
layer_dropout(rate = 0.2) %>% # The key to reduce overfitting
layer_dense(units = 16, activation = "relu") %>%
layer_dense(units = 1)
model %>% compile(
loss = "mse",
optimizer = optimizer_adam(),
metrics = list("mae")
)
###############################
##### Other models tried: #####
###############################
## Original model: No drop out
# model <- keras_model_sequential() %>%
# layer_dense(units = 32, activation = "relu", input_shape = ncol(x_train)) %>%
# layer_dense(units = 16, activation = "relu") %>%
# layer_dense(units = 1)
# Bad metrics when comparing cross validation to training error
## Model with wider layers
# model <- keras_model_sequential() %>%
# layer_dense(units = 64, activation = "relu", input_shape = ncol(x_train)) %>%
# layer_dense(units = 32, activation = "relu") %>%
# layer_dense(units = 1)
# Needlessly complex, had similar results in CV error compared to the less wide model
# the dropout layer helped the most
# Train model
history <- model %>% fit(
x_train, y_train,
epochs = 50, # Afterwhich it starts to overfit, and no extra gain for validation loss
batch_size = 32,
validation_split = 0.2,
verbose = 0
)
plot(history) # Avoids per-epoch output, but keeps loss plot
# Predict + evaluate
pred <- model %>% predict(x_test) %>% as.vector()
## 35/35 - 0s - 45ms/epoch - 1ms/step
mae <- mean(abs(pred - y_test))
rmse <- sqrt(mean((pred - y_test)^2))
r2 <- 1 - sum((y_test - pred)^2) / sum((y_test - mean(y_test))^2)
metrics_nn <- c(MAE = mae, RMSE = rmse, R2 = r2)
Model insight: After trying multiple different layer configurations, the models always learn very fast after about 30 iterations, however these had major overfitting problems. Adding cross validation folds didn’t have too much of an effect like in other models. Therefore a dropout layer was tested, which helped a lot. Interestingly it is only useful between 2 wide layers, and not effective as the second to last layer. This is likely due ot overfitting taking part in the earlier wider layers. The weights between the last layers is more scarce and harder to influence through drop out.
# Combine all metric vectors into a list
metrics_list <- list(
tree = metrics_tree,
ens1 = metrics_ens1,
ens2 = metrics_ens2,
nn = metrics_nn
)
# Convert to data frame
metrics_df <- do.call(rbind, metrics_list)
# Add model names as a column
metrics_df <- data.frame(Model = rownames(metrics_df), metrics_df, row.names = NULL)
# Ensure column names are the same
colnames(metrics_eriks_models) <- colnames(metrics_df)
# Bind the data frames
combined_metrics <- dplyr::bind_rows(metrics_eriks_models, metrics_df)
# OR using base R
combined_metrics <- rbind(metrics_eriks_models, metrics_df)
# Optional: pretty table
library(knitr)
kable(combined_metrics, digits = 3, caption = "Model Performance Metrics")
| Model | MAE | RMSE | R2 |
|---|---|---|---|
| LM Baseline | 43.341 | 73.001 | 0.579 |
| LM + IA | 30.643 | 55.624 | 0.755 |
| LM + IA + FF | 30.021 | 54.221 | 0.768 |
| tree | 29.080 | 60.502 | 0.702 |
| ens1 | 45.999 | 84.918 | 0.413 |
| ens2 | 26.719 | 51.144 | 0.787 |
| nn | 26.205 | 52.605 | 0.775 |
# Drop non-numeric / ID columns from training data
df_base <- df[, !(names(df) %in% c("Date_Hour"))]
# Define categorical variables to one-hot encode
categorical_vars <- c("precip_type", "cloud_cover", "Hour", "Weekday", "Holiday",
"School_holidays", "wind_dir", "Month")
# One-hot encode categorical variables using model.matrix (no intercept)
dummies <- model.matrix(~ . -1, data = df_base[, categorical_vars])
# Extract numeric variables
numeric_vars <- df_base[, !(names(df_base) %in% categorical_vars)]
# Combine numeric variables and dummy variables
df_encoded <- cbind(numeric_vars, dummies)
# Remove rows with NA in training set
clean_df <- na.omit(df_encoded)
# Run PCA on scaled data
pca <- prcomp(clean_df, scale. = TRUE)
# Summary and scree plot
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0718 1.76451 1.40242 1.29641 1.24209 1.21056 1.18307
## Proportion of Variance 0.0588 0.04265 0.02694 0.02302 0.02113 0.02007 0.01917
## Cumulative Proportion 0.0588 0.10145 0.12839 0.15142 0.17255 0.19262 0.21180
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.17237 1.13945 1.13051 1.12211 1.11865 1.11101 1.10927
## Proportion of Variance 0.01883 0.01779 0.01751 0.01725 0.01714 0.01691 0.01686
## Cumulative Proportion 0.23063 0.24841 0.26592 0.28317 0.30031 0.31722 0.33407
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 1.09851 1.09644 1.08699 1.08279 1.07558 1.07357 1.06638
## Proportion of Variance 0.01653 0.01647 0.01619 0.01606 0.01585 0.01579 0.01558
## Cumulative Proportion 0.35060 0.36707 0.38326 0.39932 0.41517 0.43096 0.44653
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 1.06220 1.05829 1.05591 1.05314 1.04543 1.04346 1.04077
## Proportion of Variance 0.01546 0.01534 0.01527 0.01519 0.01497 0.01492 0.01484
## Cumulative Proportion 0.46199 0.47733 0.49260 0.50780 0.52277 0.53768 0.55252
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 1.0395 1.0289 1.02843 1.02735 1.0253 1.02348 1.02147
## Proportion of Variance 0.0148 0.0145 0.01449 0.01446 0.0144 0.01435 0.01429
## Cumulative Proportion 0.5673 0.5818 0.59632 0.61077 0.6252 0.63952 0.65382
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 1.01941 1.01679 1.01476 1.01155 1.00691 1.00612 1.00252
## Proportion of Variance 0.01424 0.01416 0.01411 0.01402 0.01389 0.01387 0.01377
## Cumulative Proportion 0.66805 0.68221 0.69632 0.71034 0.72423 0.73809 0.75186
## PC43 PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 0.99963 0.99762 0.99556 0.98849 0.98432 0.98018 0.97532
## Proportion of Variance 0.01369 0.01363 0.01358 0.01339 0.01327 0.01316 0.01303
## Cumulative Proportion 0.76555 0.77918 0.79276 0.80614 0.81942 0.83258 0.84561
## PC50 PC51 PC52 PC53 PC54 PC55 PC56
## Standard deviation 0.97333 0.96405 0.9629 0.96001 0.93559 0.91099 0.89656
## Proportion of Variance 0.01298 0.01273 0.0127 0.01262 0.01199 0.01137 0.01101
## Cumulative Proportion 0.85859 0.87132 0.8840 0.89664 0.90863 0.92000 0.93101
## PC57 PC58 PC59 PC60 PC61 PC62 PC63
## Standard deviation 0.88218 0.84728 0.83169 0.77519 0.69297 0.63749 0.59342
## Proportion of Variance 0.01066 0.00983 0.00948 0.00823 0.00658 0.00557 0.00482
## Cumulative Proportion 0.94168 0.95151 0.96098 0.96922 0.97579 0.98136 0.98619
## PC64 PC65 PC66 PC67 PC68 PC69 PC70
## Standard deviation 0.55112 0.46071 0.39800 0.36394 0.25032 0.23827 0.20367
## Proportion of Variance 0.00416 0.00291 0.00217 0.00181 0.00086 0.00078 0.00057
## Cumulative Proportion 0.99035 0.99325 0.99542 0.99724 0.99810 0.99887 0.99944
## PC71 PC72 PC73
## Standard deviation 0.20172 2.584e-15 2.325e-15
## Proportion of Variance 0.00056 0.000e+00 0.000e+00
## Cumulative Proportion 1.00000 1.000e+00 1.000e+00
plot(pca, type = "l")
# Choose first 3 PCs
train_pca <- as.data.frame(pca$x[, 1:3])
# Add response variable (assuming Bikes_Counted is in clean_df or df)
train_pca$Bikes_Counted <- clean_df$Bikes_Counted
# Prepare test set:
# Select same variables (numeric + dummies) from test, except Date_Hour
test_base <- test[, !(names(test) %in% c("Date_Hour"))]
# One-hot encode categorical vars in test with model.matrix
test_dummies <- model.matrix(~ . -1, data = test_base[, categorical_vars])
# Numeric vars in test
test_numeric <- test_base[, !(names(test_base) %in% categorical_vars)]
# Combine test numeric + dummies
test_encoded <- cbind(test_numeric, test_dummies)
# Make sure test_encoded has the same columns as clean_df (train encoded)
# If test has missing columns, add them with zeros
missing_cols <- setdiff(names(clean_df), names(test_encoded))
for(col in missing_cols) {
test_encoded[, col] <- 0
}
# Drop extra columns in test not in train
extra_cols <- setdiff(names(test_encoded), names(clean_df))
if(length(extra_cols) > 0) test_encoded <- test_encoded[, !(names(test_encoded) %in% extra_cols)]
# Order columns to match train
test_encoded <- test_encoded[, names(clean_df)]
# Remove rows with NA in test (optional)
test_encoded <- na.omit(test_encoded)
# Scale test set using train PCA parameters
test_scaled <- scale(test_encoded, center = pca$center, scale = pca$scale)
# Apply PCA rotation on test set (first 3 PCs)
test_pca <- as.data.frame(test_scaled %*% pca$rotation[, 1:3])
# Add response variable to test PCA dataframe
test_pca$Bikes_Counted <- test$Bikes_Counted
At first only numerical variables were used in the PCA. After being scaled 3 PCAs were chosen. Though using the data and realisisng that the categorical varialbes are very influencial in modelling bikes counted, more research was done to get those varaibles involved in the PCA analysis.
All categorical variables were hot encoded as dummies for each category, with these now being able to represent their dimention in the PCA if they are influenceial enough. The scree plot after adding all hot-encoded categorical varaibles still indicated 3 PCAs capturing variance most efficiently, however the whole curve was more smooth, indicating the categorical varaibles did have an effect. Looking at individual dummies’ factor loadings would also confirm this, however there are many.
3 PCAs will be used. Afterwhich the marginal explenation power of the PCAs drops down drastically (See plot).
The rest of the models are copies of the initial ones, applied to 3 PCS. Interestingly these take less time to run, due to the large decrease in data volume needing to be processed.
Repeat the same procedure as for other lms: 5 iterations of hold-out cross-validation with 80% Training - 20% Testing split:
### Conduct CV analogously to the other linear models
set.seed(0345)
iter_cv <- 5
pca_cv_res_matrix <- matrix(NA, nrow = iter_cv, ncol = (1 * 2 * 3)) # 3 models - 2 splits - 3 metrics
models <- c("LM (PCA)")
splits <- c("Train", "Test")
metrics <- c("MAE","RMSE", "R2")
colnames(pca_cv_res_matrix) <- paste(rep(models, each = 6),
rep(rep(splits, each = 3)),
rep(metrics, 2),
sep = ", ")
# For hold-out CV: Need entire PC dataset!
whole_pca_dataset <- rbind(train_pca, test_pca)
whole_pca_dataset <- whole_pca_dataset[order(rownames(whole_pca_dataset) |> as.numeric()),]
# Store cross-validation RMSE
for(j in 1:iter_cv) {
# Hold-out CV: always get 80% for training and 20% for testing
train_sample <- sample(1:nrow(whole_pca_dataset), floor(0.8 * nrow(whole_pca_dataset)))
data_train <- whole_pca_dataset[train_sample,]
data_test <- whole_pca_dataset[- train_sample,]
# Train lm and find performance on training set
lm0 <- lm(formula = Bikes_Counted ~ . , data = data_train)
train_pred_0 <- predict(lm0, data_train) |> no_negative_predictions()
# Predict on testing set as well
test_pred_0 <- predict(lm0, data_test) |> no_negative_predictions()
pca_cv_res_matrix[j, "LM (PCA), Train, MAE"] <- calc_mae(data_train$Bikes_Counted, train_pred_0)
pca_cv_res_matrix[j, "LM (PCA), Train, RMSE"] <- calc_rmse(data_train$Bikes_Counted, train_pred_0)
pca_cv_res_matrix[j, "LM (PCA), Train, R2"] <- calc_rsq(data_train$Bikes_Counted, train_pred_0)
pca_cv_res_matrix[j, "LM (PCA), Test, MAE"] <- calc_mae(data_test$Bikes_Counted, test_pred_0)
pca_cv_res_matrix[j, "LM (PCA), Test, RMSE"] <- calc_rmse(data_test$Bikes_Counted, test_pred_0)
pca_cv_res_matrix[j, "LM (PCA), Test, R2"] <- calc_rsq(data_test$Bikes_Counted, test_pred_0)
}
print(pca_cv_res_matrix |> as.data.frame())
## LM (PCA), Train, MAE LM (PCA), Train, RMSE LM (PCA), Train, R2
## 1 42.81623 75.47941 0.5432904
## 2 42.81649 74.64720 0.5535519
## 3 42.51508 74.09849 0.5488918
## 4 43.89116 76.33699 0.5577187
## 5 42.39546 74.31666 0.5522816
## LM (PCA), Test, MAE LM (PCA), Test, RMSE LM (PCA), Test, R2
## 1 42.21231 68.20898 0.5840200
## 2 42.16285 71.11037 0.5470057
## 3 41.83485 73.80645 0.5607646
## 4 39.38429 62.28175 0.5344216
## 5 43.57027 72.34487 0.5544129
# cv_res_matrix |> as.data.frame() |> View()
pca_mean_of_all_iters <- apply(pca_cv_res_matrix, 2, mean)
pca_cv_res_inspection <- pca_cv_res_matrix |> apply(2, mean) |> matrix(nrow = 3) |> t()
colnames(pca_cv_res_inspection) <- c("MAE", "RMSE", "R2")
rownames(pca_cv_res_inspection) <- paste(rep(c("LM (3 PCAs)"), each = 2),
c("Train", "Test"), sep = ": ")
# Bring it beautifully together
cv_res_inspection_final <- rbind(cv_res_inspection, pca_cv_res_inspection) |> data.frame()
Add number of explanatory variables of every model:
cv_res_inspection_final$n_explanatories <- rep(
c(
lm_exp |> model.matrix() |> ncol(),
lm_exp_plus_interaction |> model.matrix() |> ncol(),
lm_exp_plus_interaction_functforms |> model.matrix() |> ncol(),
lm0 |> model.matrix() |> ncol() # lm0 still the PC lm from CV
),
each = 2
)
print(cv_res_inspection_final)
## MAE RMSE R2 n_explanatories
## LM: Train 71.85036 38.03583 0.5779033 71
## LM: Test 69.43331 37.46385 0.5852821 71
## LM + IA: Train 54.90258 27.74394 0.7535397 686
## LM + IA: Test 53.26840 27.56767 0.7558257 686
## LM + IA + FF: Train 53.38733 27.03963 0.7669553 691
## LM + IA + FF: Test 51.87950 26.90295 0.7683806 691
## LM (3 PCAs): Train 42.88688 74.97575 0.5511469 4
## LM (3 PCAs): Test 41.83292 69.55049 0.5561250 4
It can be seen that the model only using the three PCs (+ intercept)
as explantory variables is roughly at the same level as the baseline lm
which uses about 17 variables – due to factor variables,
n_explanatories is higher, because every possible
realization (except for one) is calculated an individual estimate for.
This is a good point in favor of the PCs. However, the lms
including interaction terms and other functional forms are still
outperforming the PCA model.
As we see and would have expected, overfitting is not much of a big
deal for an OLS model with > 50000 observations, even if
we use up to factually 500 explanatory variables in case we
include interaction terms and functional forms.
library(rpart)
# Train initial tree model on PCA-transformed training data
model_pca <- rpart(Bikes_Counted ~ ., data = train_pca, method = "anova", cp = 0, maxdepth = 5)
# maxdepth = 5 good balance between training - cv loss gap while not having both increase
# Use cross-validation info to find optimal cp
best_cp_pca <- model_pca$cptable[which.min(model_pca$cptable[, "xerror"]), "CP"]
# Prune the tree
pruned_model_pca <- prune(model_pca, cp = best_cp_pca)
# Predict on PCA-transformed test data
pred_pca <- predict(pruned_model_pca, newdata = test_pca)
# True values
actual_pca <- test_pca$Bikes_Counted
# Metrics
mae_pca <- mean(abs(pred_pca - actual_pca))
rmse_pca <- sqrt(mean((pred_pca - actual_pca)^2))
r2_pca <- 1 - sum((actual_pca - pred_pca)^2) / sum((actual_pca - mean(actual_pca))^2)
metrics_tree_pca <- c(MAE = mae_pca, RMSE = rmse_pca, R2 = r2_pca)
# Training MSE
train_pred_pca <- predict(pruned_model_pca, newdata = train_pca)
train_actual_pca <- train_pca$Bikes_Counted
training_mse_pca <- mean((train_actual_pca - train_pred_pca)^2)
# CV MSE (xerror * variance of response)
cv_mse_pca <- min(model_pca$cptable[, "xerror"]) * var(train_actual_pca)
# Output MSE losses
losses_tree_pca <- c(Training_MSE = training_mse_pca, CV_MSE = cv_mse_pca)
losses_tree_pca
## Training_MSE CV_MSE
## 3148.354 4042.605
library(randomForest)
library(caret)
# Remove Date_Hour from PCA data if present
train_pca <- train_pca[, !(names(train_pca) %in% c("Date_Hour"))]
# Prepare predictors and target
x_pca <- train_pca[, -which(names(train_pca) == "Bikes_Counted")]
y_pca <- train_pca$Bikes_Counted
set.seed(42)
# Tune mtry with OOB error on PCA features
best_mtry_pca <- tuneRF(
x = x_pca,
y = y_pca,
mtryStart = 3,
stepFactor = 1.5,
improve = 0.05,
ntreeTry = 100,
trace = TRUE,
plot = TRUE
)
## mtry = 3 OOB error = 3550.145
## Searching left ...
## mtry = 2 OOB error = 3530.251
## 0.005603713 0.05
## Searching right ...
optimal_mtry <- best_mtry_pca[which.min(best_mtry_pca[, "OOBError"]), "mtry"]
# Setup 10-fold cross-validation
ctrl <- trainControl(method = "cv", number = 10, savePredictions = "final")
set.seed(42)
# Train random forest with cross-validation
rf_pca_cv <- caret::train(
Bikes_Counted ~ .,
data = train_pca,
method = "rf",
trControl = ctrl,
tuneGrid = data.frame(mtry = optimal_mtry),
ntree = 500,
nodesize = 40,
maxnodes = 15,
importance = TRUE
)
# Extract CV metrics
cv_rmse <- rf_pca_cv$results$RMSE
cv_mse <- cv_rmse^2
# Final model fitted on full training data
rf_pca <- rf_pca_cv$finalModel
# Training set predictions and MSE
train_preds <- predict(rf_pca, newdata = train_pca)
training_mse_pca <- mean((train_pca$Bikes_Counted - train_preds)^2)
# Prepare test data (remove Date_Hour and Bikes_Counted)
test_pca_data <- test_pca[, !(names(test_pca) %in% c("Date_Hour", "Bikes_Counted"))]
# Test set predictions and metrics
preds <- predict(rf_pca, newdata = test_pca_data)
actuals <- test_pca$Bikes_Counted
mae <- mean(abs(preds - actuals))
rmse <- sqrt(mean((preds - actuals)^2))
r2 <- 1 - sum((actuals - preds)^2) / sum((actuals - mean(actuals))^2)
# Output metrics
metrics_ens1_pca <- c(MAE = mae, RMSE = rmse, R2 = r2)
losses_rf_pca <- c(Training_MSE = training_mse_pca, CV_MSE = cv_mse)
losses_rf_pca
## Training_MSE CV_MSE
## 3320.574 3614.888
library(gbm)
# Use PCA-transformed training and test data
train_data_pca <- train_pca
test_data_pca <- test_pca[, -which(names(test_pca) == "Bikes_Counted")]
set.seed(123)
model_gbm_pca <- gbm(
Bikes_Counted ~ .,
data = train_data_pca,
distribution = "gaussian",
n.trees = 200,
interaction.depth = 2,
shrinkage = 0.05,
cv.folds = 15
)
# Best number of trees based on CV
best_iter_pca <- gbm.perf(model_gbm_pca, method = "cv", plot.it = FALSE)
# Predict on test set
pred_pca <- predict(model_gbm_pca, newdata = test_data_pca, n.trees = best_iter_pca)
actual_pca <- test_pca$Bikes_Counted
# Evaluation metrics
mae_pca <- mean(abs(pred_pca - actual_pca))
rmse_pca <- sqrt(mean((pred_pca - actual_pca)^2))
r2_pca <- 1 - sum((actual_pca - pred_pca)^2) / sum((actual_pca - mean(actual_pca))^2)
# Store test metrics
metrics_ens2_pca <- c(MAE = mae_pca, RMSE = rmse_pca, R2 = r2_pca)
# Training and CV loss over iterations
train_loss_pca <- model_gbm_pca$train.error
val_loss_pca <- model_gbm_pca$cv.error
# Plot training and validation MSE
plot(train_loss_pca, type = "l", col = "grey", lwd = 2, ylab = "MSE", xlab = "Number of Trees")
lines(val_loss_pca, col = "orange", lwd = 2)
legend("topright", legend = c("Training MSE", "CV MSE"), col = c("grey", "orange"), lwd = 2)
# Store selected losses
losses_gbm_pca <- c(Training_MSE = train_loss_pca[best_iter_pca], CV_MSE = val_loss_pca[best_iter_pca])
library(keras)
# Prepare PCA-based training and test matrices
x_train_pca <- train_pca[, -which(names(train_pca) == "Bikes_Counted")] %>% as.matrix()
x_test_pca <- test_pca[, -which(names(test_pca) == "Bikes_Counted")] %>% as.matrix()
y_train_pca <- train_pca$Bikes_Counted
y_test_pca <- test_pca$Bikes_Counted
# Build Keras model: Same as the previous NN
model_pca <- keras_model_sequential() %>%
layer_dense(units = 32, activation = "relu", input_shape = ncol(x_train_pca)) %>%
layer_dropout(rate = 0.2) %>% # The key to reduce overfitting
layer_dense(units = 16, activation = "relu") %>%
layer_dense(units = 1)
model_pca %>% compile(
loss = "mse",
optimizer = optimizer_adam(),
metrics = list("mae")
)
# Train model
history_pca <- model_pca %>% fit(
x_train_pca, y_train_pca,
epochs = 200,
batch_size = 16,
validation_split = 0.2,
verbose = 0
)
plot(history_pca)
# Predict and evaluate
pred_pca <- model_pca %>% predict(x_test_pca) %>% as.vector()
## 35/35 - 0s - 32ms/epoch - 914us/step
mae_pca <- mean(abs(pred_pca - y_test_pca))
rmse_pca <- sqrt(mean((pred_pca - y_test_pca)^2))
r2_pca <- 1 - sum((y_test_pca - pred_pca)^2) / sum((y_test_pca - mean(y_test_pca))^2)
metrics_nn_pca <- c(MAE = mae_pca, RMSE = rmse_pca, R2 = r2_pca)
# Combine all metric vectors into a list
metrics_list <- list(
tree = metrics_tree,
ens1 = metrics_ens1,
ens2 = metrics_ens2,
nn = metrics_nn,
tree_pca = metrics_tree_pca,
ens1_pca = metrics_ens1_pca,
ens2_pca = metrics_ens2_pca,
nn_pca = metrics_nn_pca
)
# Convert to data frame
metrics_df <- do.call(rbind, metrics_list)
# Add model names as a column
metrics_df <- data.frame(Model = rownames(metrics_df), metrics_df, row.names = NULL)
# Ensure column names are the same
colnames(metrics_eriks_models) <- colnames(metrics_df)
# Bind the data frames
combined_metrics <- dplyr::bind_rows(metrics_eriks_models, metrics_df)
# OR using base R
combined_metrics <- rbind(metrics_eriks_models, metrics_df)
# Adding lm_pca to metrics table
last_row <- cv_res_inspection_final[nrow(cv_res_inspection_final), ]
lm_pca_metrics <- data.frame(
Model = "LM (3 PCAs)",
MAE = last_row$MAE,
RMSE = last_row$RMSE,
R2 = last_row$R2
)
# Insert at position 8
combined_metrics <- rbind(
combined_metrics[1:7, ],
lm_pca_metrics,
combined_metrics[8:nrow(combined_metrics), ]
)
# Optional: pretty table
library(knitr)
kable(combined_metrics, digits = 3, caption = "Model Performance Metrics")
| Model | MAE | RMSE | R2 | |
|---|---|---|---|---|
| 1 | LM Baseline | 43.341 | 73.001 | 0.579 |
| 2 | LM + IA | 30.643 | 55.624 | 0.755 |
| 3 | LM + IA + FF | 30.021 | 54.221 | 0.768 |
| 4 | tree | 29.080 | 60.502 | 0.702 |
| 5 | ens1 | 45.999 | 84.918 | 0.413 |
| 6 | ens2 | 26.719 | 51.144 | 0.787 |
| 7 | nn | 26.205 | 52.605 | 0.775 |
| 8 | LM (3 PCAs) | 41.833 | 69.550 | 0.556 |
| 81 | tree_pca | 33.070 | 52.640 | 0.774 |
| 9 | ens1_pca | 33.843 | 56.067 | 0.744 |
| 10 | ens2_pca | 32.533 | 52.299 | 0.777 |
| 11 | nn_pca | 33.849 | 54.529 | 0.758 |
Bringing it together, we can see that more complex models do not necessarily outperform the linear models when we train them in a way that avoids over-fitting. Thus, the linear model with interaction terms already explains roughly \(\frac{3}{4}\) of the variation of the data. The neural network, despite higher complexity and being the best-performer of the other models, is only moderately better.
The models with PCA have slightly less performance compared to their non-PCA counterparts. This is expected since some of the varicance from all the data is lost. These models do however perform very well considering the larde decrease in varaibles. This shows that PCA is a powerful tool to reduce dimentionality. It was felt when running the models that these were faster to run, and for even larger datasets PCA saves a lot of compute power compares to raw data.
If we were to develop a model for a use case exceeding the scope of the coursework, then including time lags or aggregated data from the day before could be interesting explanatory variables as well, as has been experimentally outlined in the lm-section. This could be extended for the other models.
All in all, the most relevant determinants for bike traffic seem to be ex-ante known criteria, such as the weekday, month (seasonality), whether it’s a public holiday or not, and hours across a day, which is consistent with what one would expect (especially thanks to regular bike commuters). However, weather data to some extent helps to get more precise predictions.